Automatic Speech Recognition (ASR) systems have achieved remarkable performance on widely used benchmarks such as LibriSpeech and Fleurs. However, these benchmarks do not adequately reflect the complexities of real-world conversational environments, where speech is often unstructured and contains disfluencies such as pauses, interruptions, and diverse accents. In this study, we introduce a multilingual conversational dataset, derived from TalkBank, consisting of unstructured phone conversation between adults. Our results show a significant performance drop across various state-of-the-art ASR models when tested in conversational settings. Furthermore, we observe a correlation between Word Error Rate and the presence of speech disfluencies, highlighting the critical need for more realistic, conversational ASR benchmarks.
自动语音识别(ASR)系统在广泛使用的基准测试如LibriSpeech和Fleurs等取得了显著的性能。然而,这些基准测试并不能充分反映现实世界会话环境中的复杂性,那里语音往往无结构,包含诸如停顿、打断和不同口音等语音瑕疵。在本研究中,我们引入了一个多语言会话数据集,基于TalkBank,包括成人在通话中的不结构化语音对话。我们的结果表明,在会话环境中测试各种最先进的ASR模型时,这些模型的性能下降明显。此外,我们观察到单词错误率和语音瑕疵的存在之间存在相关性,强调了对更真实、会话式的ASR基准测试的迫切需要。
https://arxiv.org/abs/2409.12042
Graph Neural Networks based on the message-passing (MP) mechanism are a dominant approach for handling graph-structured data. However, they are inherently limited to modeling only pairwise interactions, making it difficult to explicitly capture the complexity of systems with $n$-body relations. To address this, topological deep learning has emerged as a promising field for studying and modeling higher-order interactions using various topological domains, such as simplicial and cellular complexes. While these new domains provide powerful representations, they introduce new challenges, such as effectively modeling the interactions among higher-order structures through higher-order MP. Meanwhile, structured state-space sequence models have proven to be effective for sequence modeling and have recently been adapted for graph data by encoding the neighborhood of a node as a sequence, thereby avoiding the MP mechanism. In this work, we propose a novel architecture designed to operate with simplicial complexes, utilizing the Mamba state-space model as its backbone. Our approach generates sequences for the nodes based on the neighboring cells, enabling direct communication between all higher-order structures, regardless of their rank. We extensively validate our model, demonstrating that it achieves competitive performance compared to state-of-the-art models developed for simplicial complexes.
基于消息传递(MP)机制的图神经网络是一种处理图结构数据的主导方法。然而,它们本质上只能建模成对之间的相互作用,这使得很难明确地捕捉具有$n$-体关系系统的复杂性。为解决这个问题,拓扑深度学习已经成为研究各种拓扑域(如简单和细胞复杂)用于建模更高阶相互作用的有趣领域。虽然这些新领域提供了强大的表示,但它们引入了新的挑战,例如通过更高阶MP有效地建模较高阶结构之间的相互作用。同时,结构化状态空间序列模型已被证明对于序列建模非常有效,并且最近已通过将邻居节点编码为序列,从而避免MP机制,应用于图数据。在本文中,我们提出了一个新架构,旨在与简单和细胞复杂模型一起操作,其基于Mamba状态空间模型作为基础。我们的方法根据邻居细胞生成节点的序列,使所有高阶结构之间实现直接通信,而不管它们的级别。我们充分验证了我们的模型,证明了与为简单和细胞复杂模型开发的最新模型相比,它具有竞争性能。
https://arxiv.org/abs/2409.12033
The use of data-driven methods in fluid mechanics has surged dramatically in recent years due to their capacity to adapt to the complex and multi-scale nature of turbulent flows, as well as to detect patterns in large-scale simulations or experimental tests. In order to interpret the relationships generated in the models during the training process, numerical attributions need to be assigned to the input features. One important example are the additive-feature-attribution methods. These explainability methods link the input features with the model prediction, providing an interpretation based on a linear formulation of the models. The SHapley Additive exPlanations (SHAP values) are formulated as the only possible interpretation that offers a unique solution for understanding the model. In this manuscript, the additive-feature-attribution methods are presented, showing four common implementations in the literature: kernel SHAP, tree SHAP, gradient SHAP, and deep SHAP. Then, the main applications of the additive-feature-attribution methods are introduced, dividing them into three main groups: turbulence modeling, fluid-mechanics fundamentals, and applied problems in fluid dynamics and heat transfer. This review shows thatexplainability techniques, and in particular additive-feature-attribution methods, are crucial for implementing interpretable and physics-compliant deep-learning models in the fluid-mechanics field.
近年来,在流体力学中使用数据驱动方法的数量急剧增加,这是因为他们能够适应湍流流动的复杂和多尺度特性,以及在大规模模拟或实验测试中检测模式的能力。为了解释训练过程中模型中产生的关系,需要对输入特征进行数值归因。一个重要的例子是增广特征归因方法。这些可解释性方法将输入特征与模型预测相连接,提供基于模型线性表示的解释。Shapley Additive Explanations (SHAP) 形式的增广特征归因方法被表述为唯一可能的解释,它提供了解释模型的独特解决方案。 在本文中,我们介绍了四种在文献中常见的增广特征归因方法:核 SHAP、树 SHAP、梯度 SHAP 和深度 SHAP。然后,我们介绍了这些增广特征归因方法的主要应用,将它们分为三个主要组:湍流建模、流体力学基础和流体动力学和热传递应用问题。 这个综述表明,可解释性技术和特别是增广特征归因方法在流体力学领域至关重要,为实现可解释和符合物理规律的深度学习模型奠定了基础。
https://arxiv.org/abs/2409.11992
To address the intricate challenges of decentralized cooperative scheduling and motion planning in Autonomous Mobility-on-Demand (AMoD) systems, this paper introduces LMMCoDrive, a novel cooperative driving framework that leverages a Large Multimodal Model (LMM) to enhance traffic efficiency in dynamic urban environments. This framework seamlessly integrates scheduling and motion planning processes to ensure the effective operation of Cooperative Autonomous Vehicles (CAVs). The spatial relationship between CAVs and passenger requests is abstracted into a Bird's-Eye View (BEV) to fully exploit the potential of the LMM. Besides, trajectories are cautiously refined for each CAV while ensuring collision avoidance through safety constraints. A decentralized optimization strategy, facilitated by the Alternating Direction Method of Multipliers (ADMM) within the LMM framework, is proposed to drive the graph evolution of CAVs. Simulation results demonstrate the pivotal role and significant impact of LMM in optimizing CAV scheduling and enhancing decentralized cooperative optimization process for each vehicle. This marks a substantial stride towards achieving practical, efficient, and safe AMoD systems that are poised to revolutionize urban transportation. The code is available at this https URL.
为解决自主移动需求系统(AMoD)中分布式协作调度和运动规划的复杂挑战,本文引入了LMMCoDrive,一种新颖的协作驾驶框架,利用大型多模态模型(LMM)在动态城市环境中提高交通效率。该框架将调度和运动规划过程无缝集成,确保协作自动驾驶车辆(CAVs)的有效运行。将CAV与乘客需求的地理关系抽象成鸟瞰图(BEV),充分发掘LMM的潜力。此外,在确保碰撞避免的安全约束条件下,为每个CAV精细优化轨迹。 在LMM框架内,由交替方向乘子法(ADMM)推动的分布式优化策略被提出,以驱动CAV的图进化。仿真结果表明,LMM在优化CAV调度和提高每个车辆的分布式合作优化过程方面具有关键作用和重大影响。这标志着朝着实现实用、高效和安全的AMoD系统迈出了重要一步,这些系统有潜力彻底颠覆城市交通。代码可在此处访问:https://www.url.
https://arxiv.org/abs/2409.11981
Recent studies suggest a potential link between the physical structure of mitochondria and neurodegenerative diseases. With advances in Electron Microscopy techniques, it has become possible to visualize the boundary and internal membrane structures of mitochondria in detail. It is crucial to automatically segment mitochondria from these images to investigate the relationship between mitochondria and diseases. In this paper, we present a software solution for mitochondrial segmentation, highlighting mitochondria boundaries in electron microscopy tomography images and generating corresponding 3D meshes.
近年来,研究表明,线粒体的形态结构与神经退行性疾病之间存在潜在联系。随着电子显微镜技术的进步,已经能够详细可视化线粒体的边界和内部膜结构。从这些图像中自动分割线粒体对于研究线粒体与疾病之间的关系至关重要。在本文中,我们提出了一个用于线粒体分割的软件解决方案,重点关注电子显微镜断层图像中的线粒体边界,并生成相应的3D网格。
https://arxiv.org/abs/2409.11974
Understanding the relationships between geometric structures and semantic concepts is crucial for building accurate models of complex environments. In indoors, certain spatial constraints, such as the relative positioning of planes, remain consistent despite variations in layout. This paper explores how these invariant relationships can be captured in a graph SLAM framework by representing high-level concepts like rooms and walls, linking them to geometric elements like planes through an optimizable factor graph. Several efforts have tackled this issue with add-hoc solutions for each concept generation and with manually-defined factors. This paper proposes a novel method for metric-semantic factor graph generation which includes defining a semantic scene graph, integrating geometric information, and learning the interconnecting factors, all based on Graph Neural Networks (GNNs). An edge classification network (G-GNN) sorts the edges between planes into same room, same wall or none types. The resulting relations are clustered, generating a room or wall for each cluster. A second family of networks (F-GNN) infers the geometrical origin of the new nodes. The definition of the factors employs the same F-GNN used for the metric attribute of the generated nodes. Furthermore, share the new factor graph with the S-Graphs+ algorithm, extending its graph expressiveness and scene representation with the ultimate goal of improving the SLAM performance. The complexity of the environments is increased to N-plane rooms by training the networks on L-shaped rooms. The framework is evaluated in synthetic and simulated scenarios as no real datasets of the required complex layouts are available.
理解几何结构与语义概念之间的关系对于构建复杂环境中的准确模型至关重要。在室内环境中,尽管布局的變化会导致某些空间约束(如平面的相对位置)发生变化,但某些空间约束仍然保持不变。本文探讨了如何通过表示高层次概念(如房间和墙)和通过一个可优化因素图来将它们与几何元素(如平面)联系起来,从而在图形SLAM框架中捕捉这些不变的关系。为了处理每个概念的生成,以及通过手动定义因素来处理这个问题,已经提出了许多解决方案。本文提出了一种新颖的方法,基于图神经网络(GNNs)生成度量语义特征图,包括定义语义场景图、整合几何信息并学习连接因子,所有这些基于GNNs。边缘分类网络(G-GNN)将平面之间的边归类为同一房间、同一墙或不存在类型。归一化关系生成了每个簇的房间或墙。 第二类网络(F-GNN)推断新节点的几何起源。定义因素采用与生成的节点度量相同的F-GNN。此外,将新因素图与S-Graphs+算法共享,通过 ultimate goal of improving the SLAM performance 扩展其图形表现力和场景表示,以增加环境的复杂性。通过在L形房间上训练网络来增加环境的复杂性,使环境复杂度达到N-plane rooms。 在 synthetic 和 simulated 场景中评估该框架,因为没有要求模拟的复杂布局的现实数据集可用,所以无法进行评估。
https://arxiv.org/abs/2409.11972
End-to-end models are emerging as the mainstream in autonomous driving perception. However, the inability to meticulously deconstruct their internal mechanisms results in diminished development efficacy and impedes the establishment of trust. Pioneering in the issue, we present the Independent Functional Module Evaluation for Bird's-Eye-View Perception Model (BEV-IFME), a novel framework that juxtaposes the module's feature maps against Ground Truth within a unified semantic Representation Space to quantify their similarity, thereby assessing the training maturity of individual functional modules. The core of the framework lies in the process of feature map encoding and representation aligning, facilitated by our proposed two-stage Alignment AutoEncoder, which ensures the preservation of salient information and the consistency of feature structure. The metric for evaluating the training maturity of functional modules, Similarity Score, demonstrates a robust positive correlation with BEV metrics, with an average correlation coefficient of 0.9387, attesting to the framework's reliability for assessment purposes.
端到端模型在自动驾驶感知中正逐渐成为主流。然而,无法详细分解其内部机制导致开发效果减弱,并阻碍了信任的建立。在这些问题上领先一步,我们提出了一个名为独立功能模块评估的模型(BEV-IFME),这是一种将鸟瞰感知模型的模块特征图在统一语义表示空间中与真实情况下的地面真相对比来量化其相似性的新框架,从而评估个人功能模块的训练成熟度。框架的核心在于特征图编码和表示的同步过程,通过我们提出的两阶段对齐自编码器确保保留突出信息并保持特征结构的一致性。评估功能模块训练成熟度的指标,相似度分数,与BEV指标之间展现出稳健的正相关关系,平均相关系数为0.9387,证明了该框架在评估目的上的可靠性。
https://arxiv.org/abs/2409.11969
Despite increasing research efforts on household robotics, robots intended for deployment in domestic settings still struggle with more complex tasks such as interacting with functional elements like drawers or light switches, largely due to limited task-specific understanding and interaction capabilities. These tasks require not only detection and pose estimation but also an understanding of the affordances these elements provide. To address these challenges and enhance robotic scene understanding, we introduce SpotLight: A comprehensive framework for robotic interaction with functional elements, specifically light switches. Furthermore, this framework enables robots to improve their environmental understanding through interaction. Leveraging VLM-based affordance prediction to estimate motion primitives for light switch interaction, we achieve up to 84% operation success in real world experiments. We further introduce a specialized dataset containing 715 images as well as a custom detection model for light switch detection. We demonstrate how the framework can facilitate robot learning through physical interaction by having the robot explore the environment and discover previously unknown relationships in a scene graph representation. Lastly, we propose an extension to the framework to accommodate other functional interactions such as swing doors, showcasing its flexibility. Videos and Code: this http URL
尽管在家庭机器人领域的研究投入不断增加,但专为家庭环境设计的机器人仍然很难处理更复杂的任务,如与功能元素(如抽屉或灯光开关)的交互,这主要是由于它们在任务特定理解和交互能力方面的有限性。这些任务不仅要求检测和姿态估计,还需要了解这些元素提供的功能。为了应对这些挑战,提高机器人在场景理解方面的能力,我们引入了SpotLight:一个专为机器人与功能元素交互而设计的全面框架,特别是灯光开关。此外,这个框架使机器人能够通过交互来提高其环境理解。通过基于VLM的势能预测来估计灯光开关的交互运动初值,我们在现实世界实验中实现了84%的操作成功率。我们进一步引入了一个包含715个图像的专业数据集以及一个自定义的灯光开关检测模型。我们证明了框架可以通过物理交互帮助机器人学习,让机器人探索环境并发现场景图表示中 previously unknown 的关系。最后,我们提出了一个扩展框架以适应其他功能交互,如弹门,展示了其灵活性。视频和代码:这个链接
https://arxiv.org/abs/2409.11870
The landscape of Deep Learning has experienced a major shift with the pervasive adoption of Transformer-based architectures, particularly in Natural Language Processing (NLP). Novel avenues for physical applications, such as solving Partial Differential Equations and Image Vision, have been explored. However, in challenging domains like robotics, where high non-linearity poses significant challenges, Transformer-based applications are scarce. While Transformers have been used to provide robots with knowledge about high-level tasks, few efforts have been made to perform system identification. This paper proposes a novel methodology to learn a meta-dynamical model of a high-dimensional physical system, such as the Franka robotic arm, using a Transformer-based architecture without prior knowledge of the system's physical parameters. The objective is to predict quantities of interest (end-effector pose and joint positions) given the torque signals for each joint. This prediction can be useful as a component for Deep Model Predictive Control frameworks in robotics. The meta-model establishes the correlation between torques and positions and predicts the output for the complete trajectory. This work provides empirical evidence of the efficacy of the in-context learning paradigm, suggesting future improvements in learning the dynamics of robotic systems without explicit knowledge of physical parameters. Code, videos, and supplementary materials can be found at project website. See this https URL
深度学习的应用领域经历了一次重大转折,特别是自然语言处理(NLP)领域,基于Transformer的架构普遍采用。还探索了一些新颖的物理应用途径,如求解偏微分方程和图像视觉。然而,在具有巨大非线性度的挑战领域(例如机器人领域)中,基于Transformer的应用很少。虽然Transformer已经用于为机器人提供关于高级任务的知識,但很少努力用于系统识别。本文提出了一种新的方法来学习高维物理系统的元动力学模型,例如Franka机器人手臂,使用没有系统物理参数先前知识的情况下基于Transformer的架构。目标是为每个关节的扭矩信号预测感兴趣量(末端执行器姿态和关节位置)。这个预测可以作为机器人领域Deep Model预测控制框架的组件。元模型建立了扭矩和位置之间的关联,预测完整的轨迹输出。这项工作提供了在上下文学习范式下学习机器人系统有效性的实证证据,表明在缺乏明确物理参数的情况下,未来可以改进学习机器人系统的动态。代码,视频和补充材料可以在项目网站上找到。请点击这个链接
https://arxiv.org/abs/2409.11815
The event camera has demonstrated significant success across a wide range of areas due to its low time latency and high dynamic range. However, the community faces challenges such as data deficiency and limited diversity, often resulting in over-fitting and inadequate feature learning. Notably, the exploration of data augmentation techniques in the event community remains scarce. This work aims to address this gap by introducing a systematic augmentation scheme named EventAug to enrich spatial-temporal diversity. In particular, we first propose Multi-scale Temporal Integration (MSTI) to diversify the motion speed of objects, then introduce Spatial-salient Event Mask (SSEM) and Temporal-salient Event Mask (TSEM) to enrich object variants. Our EventAug can facilitate models learning with richer motion patterns, object variants and local spatio-temporal relations, thus improving model robustness to varied moving speeds, occlusions, and action disruptions. Experiment results show that our augmentation method consistently yields significant improvements across different tasks and backbones (e.g., a 4.87% accuracy gain on DVS128 Gesture). Our code will be publicly available for this community.
事件相机在其低延迟和高动态范围的成功表明,它已经在广泛的领域取得了显著的成功。然而,社区面临着数据不足和多样性有限等挑战,通常导致过拟合和不足的特征学习。值得注意的是,事件社区中数据增强技术的探索仍然很少。本文旨在通过引入一个名为EventAug的系统化增强方案来填补这一空白,以丰富空间-时间多样性。 首先,我们提出多尺度时间整合(MSTI)来丰富对象的动态速度,然后引入空间显著事件掩码(SSEM)和时间显著事件掩码(TSEM)来丰富对象的变体。我们的EventAug可以促进模型学习更丰富的运动模式、对象变体和局部空间-时间关系,从而提高模型对各种运动速度、遮挡和操作干扰的鲁棒性。实验结果表明,我们的增强方法在不同的任务和骨干网络(例如,在DVS128手势识别任务上,准确率增加了4.87%)上都取得了显著的改进。我们的代码将公开供这个社区使用。
https://arxiv.org/abs/2409.11813
Deep trackers have proven success in visual tracking. Typically, these trackers employ optimally pre-trained deep networks to represent all diverse objects with multi-channel features from some fixed layers. The deep networks employed are usually trained to extract rich knowledge from massive data used in object classification and so they are capable to represent generic objects very well. However, these networks are too complex to represent a specific moving object, leading to poor generalization as well as high computational and memory costs. This paper presents a novel and general framework termed channel distillation to facilitate deep trackers. To validate the effectiveness of channel distillation, we take discriminative correlation filter (DCF) and ECO for example. We demonstrate that an integrated formulation can turn feature compression, response map generation, and model update into a unified energy minimization problem to adaptively select informative feature channels that improve the efficacy of tracking moving objects on the fly. Channel distillation can accurately extract good channels, alleviating the influence of noisy channels and generally reducing the number of channels, as well as adaptively generalizing to different channels and networks. The resulting deep tracker is accurate, fast, and has low memory requirements. Extensive experimental evaluations on popular benchmarks clearly demonstrate the effectiveness and generalizability of our framework.
深度跟踪器已经在视觉跟踪方面取得了成功。通常,这些跟踪器使用预训练的深度网络来表示所有不同对象的多个通道特征,这些网络在某些固定层上进行优化。使用的深度网络通常是为了从大规模数据中提取丰富的知识,因此它们能够很好地表示通用对象。然而,这些网络过于复杂,无法表示特定的运动物体,导致泛化差劣以及高计算和内存成本。本文提出了一种名为通道剥离的新颖且通用的框架,以帮助深度跟踪器。为了验证通道剥离的有效性,我们以判别相关滤波器(DCF)和ECO为例。我们证明了整合公式可以将特征压缩、响应图生成和模型更新统一为一个能量最小化问题,以便在飞行中选择有用的特征通道,提高跟踪移动物体的有效性。通道剥离可以准确地提取好的通道,减轻噪音通道的影响,并通常减少通道数量,同时适应不同通道和网络。通过在流行基准上进行广泛的实验评估,我们充分证明了我们框架的有效性和通用性。
https://arxiv.org/abs/2409.11785
Recently, AI systems have made remarkable progress in various tasks. Deep Reinforcement Learning(DRL) is an effective tool for agents to learn policies in low-level state spaces to solve highly complex tasks. Researchers have introduced Intrinsic Motivation(IM) to the RL mechanism, which simulates the agent's curiosity, encouraging agents to explore interesting areas of the environment. This new feature has proved vital in enabling agents to learn policies without being given specific goals. However, even though DRL intelligence emerges through a sub-symbolic model, there is still a need for a sort of abstraction to understand the knowledge collected by the agent. To this end, the classical planning formalism has been used in recent research to explicitly represent the knowledge an autonomous agent acquires and effectively reach extrinsic goals. Despite classical planning usually presents limited expressive capabilities, PPDDL demonstrated usefulness in reviewing the knowledge gathered by an autonomous system, making explicit causal correlations, and can be exploited to find a plan to reach any state the agent faces during its experience. This work presents a new architecture implementing an open-ended learning system able to synthesize from scratch its experience into a PPDDL representation and update it over time. Without a predefined set of goals and tasks, the system integrates intrinsic motivations to explore the environment in a self-directed way, exploiting the high-level knowledge acquired during its experience. The system explores the environment and iteratively: (a) discover options, (b) explore the environment using options, (c) abstract the knowledge collected and (d) plan. This paper proposes an alternative approach to implementing open-ended learning architectures exploiting low-level and high-level representations to extend its knowledge in a virtuous loop.
近年来,AI系统在各种任务上取得了显著的进步。深度强化学习(DRL)是一种有效的工具,使智能体在低级状态空间中学习策略,以解决高度复杂的任务。研究人员引入了内生动机(IM)到强化学习(RL)机制中,模拟了智能体的好奇心,鼓励智能体探索环境中的有趣区域。这种新特性已经在使智能体在没有具体目标的情况下学习策略方面证明至关重要。然而,尽管DRL智能是通过子符号模型出现的,但仍然需要某种抽象来理解智能体收集到的知识。为此,在最近的研究中,经典规划形式被用于明确表示智能体获得的知識,并有效地达到外部的目标。尽管经典规划通常具有有限的表达能力,但PPDDL在回顾智能体收集到的知识以及明确因果关系方面表现出了有效性,并可以被用于找到智能体在经历其经验时面临的任何状态的规划方案。这项工作提出了一种新的架构,实现了一个自定义的学习系统,可以从零开始合成其经验并随时间更新。在没有预定义的目标和任务的情况下,系统通过内生动机以自导向的方式探索环境,利用其在经验中获得的先进知识。系统探索环境并递归执行:(a)发现选项,(b) 使用选项探索环境,(c) 抽象收集到的知识,(d) 规划。本文提出了利用低级和高级表示来扩展其知识以实现美德循环的另一种实现开放性学习架构的方法。
https://arxiv.org/abs/2409.11756
Autism Spectrum Disorder (ASD) significantly affects the social and communication abilities of children, and eye-tracking is commonly used as a diagnostic tool by identifying associated atypical gaze patterns. Traditional methods demand manual identification of Areas of Interest in gaze patterns, lowering the performance of gaze behavior analysis in ASD subjects. To tackle this limitation, we propose a novel method to automatically analyze gaze behaviors in ASD children with superior accuracy. To be specific, we first apply and optimize seven clustering algorithms to automatically group gaze points to compare ASD subjects with typically developing peers. Subsequently, we extract 63 significant features to fully describe the patterns. These features can describe correlations between ASD diagnosis and gaze patterns. Lastly, using these features as prior knowledge, we train multiple predictive machine learning models to predict and diagnose ASD based on their gaze behaviors. To evaluate our method, we apply our method to three ASD datasets. The experimental and visualization results demonstrate the improvements of clustering algorithms in the analysis of unique gaze patterns in ASD children. Additionally, these predictive machine learning models achieved state-of-the-art prediction performance ($81\%$ AUC) in the field of automatically constructed gaze point features for ASD diagnosis. Our code is available at \url{this https URL}.
autism spectrum disorder (ASD) significantly affects the social and communication abilities of children, and eye-tracking is commonly used as a diagnostic tool by identifying associated atypical gaze patterns. Traditional methods demand manual identification of Areas of Interest in gaze patterns, lowering the performance of gaze behavior analysis in ASD subjects. To tackle this limitation, we propose a novel method to automatically analyze gaze behaviors in ASD children with superior accuracy. To be specific, we first apply and optimize seven clustering algorithms to automatically group gaze points to compare ASD subjects with typically developing peers. Subsequently, we extract 63 significant features to fully describe the patterns. These features can describe correlations between ASD diagnosis and gaze patterns. Lastly, using these features as prior knowledge, we train multiple predictive machine learning models to predict and diagnose ASD based on their gaze behaviors. To evaluate our method, we apply our method to three ASD datasets. The experimental and visualization results demonstrate the improvements of clustering algorithms in the analysis of unique gaze patterns in ASD children. Additionally, these predictive machine learning models achieved state-of-the-art prediction performance ($81\%$ AUC) in the field of automatically constructed gaze point features for ASD diagnosis. Our code is available at \url{this <https:// this URL}.
https://arxiv.org/abs/2409.11744
Understanding emotions is fundamental to human interaction and experience. Humans easily infer emotions from situations or facial expressions, situations from emotions, and do a variety of other \emph{affective cognition}. How adept is modern AI at these inferences? We introduce an evaluation framework for testing affective cognition in foundation models. Starting from psychological theory, we generate 1,280 diverse scenarios exploring relationships between appraisals, emotions, expressions, and outcomes. We evaluate the abilities of foundation models (GPT-4, Claude-3, Gemini-1.5-Pro) and humans (N = 567) across carefully selected conditions. Our results show foundation models tend to agree with human intuitions, matching or exceeding interparticipant agreement. In some conditions, models are ``superhuman'' -- they better predict modal human judgements than the average human. All models benefit from chain-of-thought reasoning. This suggests foundation models have acquired a human-like understanding of emotions and their influence on beliefs and behavior.
理解情感是人类互动和体验的基础。人类很容易从情境或面部表情中推断情感,从情感中推断情境,以及进行各种情感认知。现代AI在推断这些方面有多擅长呢?我们为测试基础模型对情感认知的能力建立了一个评估框架。从心理理论开始,我们生成了1280个不同的情景,探讨了评价、情感、表达和结果之间的关系。我们评估了基础模型(GPT-4,Claude-3,Gemini-1.5-Pro)和人类(N=567)在这些条件下的能力。我们的结果显示,基础模型往往与人类的直觉相符,甚至超过参与者之间的共识。在某些情况下,模型表现得“超人类”——它们比平均人类更准确地预测模态人类评判。所有模型都受益于链式思维推理。这表明,基础模型已经获得了类似于人类对情感及其对信念和行为影响的理解。
https://arxiv.org/abs/2409.11733
Light-Field (LF) image is emerging 4D data of light rays that is capable of realistically presenting spatial and angular information of 3D scene. However, the large data volume of LF images becomes the most challenging issue in real-time processing, transmission, and storage. In this paper, we propose an end-to-end deep LF Image Compression method Using Disentangled Representation and Asymmetrical Strip Convolution (LFIC-DRASC) to improve coding efficiency. Firstly, we formulate the LF image compression problem as learning a disentangled LF representation network and an image encoding-decoding network. Secondly, we propose two novel feature extractors that leverage the structural prior of LF data by integrating features across different dimensions. Meanwhile, disentangled LF representation network is proposed to enhance the LF feature disentangling and decoupling. Thirdly, we propose the LFIC-DRASC for LF image compression, where two Asymmetrical Strip Convolution (ASC) operators, i.e. horizontal and vertical, are proposed to capture long-range correlation in LF feature space. These two ASC operators can be combined with the square convolution to further decouple LF features, which enhances the model ability in representing intricate spatial relationships. Experimental results demonstrate that the proposed LFIC-DRASC achieves an average of 20.5\% bit rate reductions comparing with the state-of-the-art methods.
光场(LF)图像是一种能够真实地呈现3D场景的点光源分布的4D数据。然而,LF图像的大数据量在实时处理、传输和存储过程中成为最具有挑战性的问题。在本文中,我们提出了一种端到端的光场图像压缩方法:利用分离表示和不对称条带卷积(LFIC-DRASC)以提高压缩效率。首先,我们将LF图像压缩问题形式化为学习一个分离的光场表示网络和一个图像编码-解码网络。其次,我们提出两个新的特征提取器,通过将特征整合到不同维度来利用LF数据的结构先验。同时,通过不对称条带卷积来增强LF特征的分离和去耦合。第三,我们提出了LFIC-DRASC,用于LF图像压缩,其中两个不对称条带卷积(ASC)操作水平和垂直被提出,以捕捉LF特征空间中的长距离相关性。这两个ASC操作可以与平方卷积结合以进一步解耦LF特征,从而增强模型的表示能力。实验结果表明,与最先进的方法相比,所提出的LFIC-DRASC具有平均20.5%的带宽降低。
https://arxiv.org/abs/2409.11711
Pose skeleton images are an important reference in pose-controllable image generation. In order to enrich the source of skeleton images, recent works have investigated the generation of pose skeletons based on natural language. These methods are based on GANs. However, it remains challenging to perform diverse, structurally correct and aesthetically pleasing human pose skeleton generation with various textual inputs. To address this problem, we propose a framework with GUNet as the main model, PoseDiffusion. It is the first generative framework based on a diffusion model and also contains a series of variants fine-tuned based on a stable diffusion model. PoseDiffusion demonstrates several desired properties that outperform existing methods. 1) Correct Skeletons. GUNet, a denoising model of PoseDiffusion, is designed to incorporate graphical convolutional neural networks. It is able to learn the spatial relationships of the human skeleton by introducing skeletal information during the training process. 2) Diversity. We decouple the key points of the skeleton and characterise them separately, and use cross-attention to introduce textual conditions. Experimental results show that PoseDiffusion outperforms existing SoTA algorithms in terms of stability and diversity of text-driven pose skeleton generation. Qualitative analyses further demonstrate its superiority for controllable generation in Stable Diffusion.
姿势骨骼图像是 pose 控制图像生成的关键参考。为了丰富骨骼图的来源,最近的工作基于自然语言生成姿势骨骼图。这些方法基于 GANs。然而,用各种文本输入进行多样、结构正确和美观的人体姿势骨骼图生成仍然具有挑战性。为解决这个问题,我们提出了一个基于 GUNet 的框架,名为 PoseDiffusion。它是第一个基于扩散模型的生成框架,并且还包含一系列基于稳定扩散模型的微调版本。PoseDiffusion 展示了几个优于现有方法的特征。1)正确的骨骼。GUNet,一个用于姿势扩散的去噪模型,通过引入骨骼信息在训练过程中学习人体骨骼的空间关系。2)多样性。我们通过解耦骨骼的关键点并对其进行单独特征描述,同时使用跨注意来引入文本条件。实验结果表明,PoseDiffusion 在稳定性和文本驱动姿势骨骼图生成多样性方面优于现有的 SoTA 算法。定性分析进一步证明了它在 Stable Diffusion 中具有更好的可控生成能力。
https://arxiv.org/abs/2409.11689
In this paper, we propose SRIF, a novel Semantic shape Registration framework based on diffusion-based Image morphing and Flow estimation. More concretely, given a pair of extrinsically aligned shapes, we first render them from multi-views, and then utilize an image interpolation framework based on diffusion models to generate sequences of intermediate images between them. The images are later fed into a dynamic 3D Gaussian splatting framework, with which we reconstruct and post-process for intermediate point clouds respecting the image morphing processing. In the end, tailored for the above, we propose a novel registration module to estimate continuous normalizing flow, which deforms source shape consistently towards the target, with intermediate point clouds as weak guidance. Our key insight is to leverage large vision models (LVMs) to associate shapes and therefore obtain much richer semantic information on the relationship between shapes than the ad-hoc feature extraction and alignment. As a consequence, SRIF achieves high-quality dense correspondences on challenging shape pairs, but also delivers smooth, semantically meaningful interpolation in between. Empirical evidence justifies the effectiveness and superiority of our method as well as specific design choices. The code is released at this https URL.
在本文中,我们提出了SRIF,一种基于扩散图像形态学和流量估计的新颖语义形状注册框架。更具体地说,给定一对外部分割的形状,我们首先从多视角渲染它们,然后利用基于扩散模型的图像平滑框架生成它们之间的中间图像序列。然后将图像输入到动态高斯膨胀框架中,与图像形态学处理中的中间点云一起重构和后处理。最后,针对上述目标,我们提出了一个新颖的注册模块,用于估计连续归一化流,它沿着目标形状恒定扭曲,同时通过中间点云提供弱引导。我们关键的洞察是利用大型视觉模型(LVMs)将形状与形状相关联,从而获得比临时特征提取和对齐更丰富的语义信息。因此,SRIF在具有挑战性的形状对上实现高质量的密集对应关系,同时在之间提供平滑、语义有意义的平滑。实证证据证实了我们的方法以及具体设计选择的有效性和优越性。代码发布在https://此URL上。
https://arxiv.org/abs/2409.11682
The intricate nature of real-world driving environments, characterized by dynamic and diverse interactions among multiple vehicles and their possible future states, presents considerable challenges in accurately predicting the motion states of vehicles and handling the uncertainty inherent in the predictions. Addressing these challenges requires comprehensive modeling and reasoning to capture the implicit relations among vehicles and the corresponding diverse behaviors. This research introduces an integrated framework for autonomous vehicles (AVs) motion prediction to address these complexities, utilizing a novel Relational Hypergraph Interaction-informed Neural mOtion generator (RHINO). RHINO leverages hypergraph-based relational reasoning by integrating a multi-scale hypergraph neural network to model group-wise interactions among multiple vehicles and their multi-modal driving behaviors, thereby enhancing motion prediction accuracy and reliability. Experimental validation using real-world datasets demonstrates the superior performance of this framework in improving predictive accuracy and fostering socially aware automated driving in dynamic traffic scenarios.
现实世界中驾驶环境的复杂性,特点是多辆车辆之间可能的未来状态的动态和多样性互动,提出了在准确预测车辆运动状态并处理预测不确定性方面具有重大挑战的问题。为解决这些挑战,需要全面建模和推理以捕捉车辆之间的隐含关系和相应的多样行为。这项研究引入了一个用于自主车辆(AVs)运动预测的集成框架,利用了一种新颖的关系超图交互式神经网络(RHINO)。RHINO通过将多尺度超图神经网络与超图推理相结合,建模多辆车辆之间车辆之间的多模态驾驶行为,从而提高了运动预测的准确性和可靠性。使用真实世界数据集进行实验验证,证明了这种框架在提高预测准确性和促进社会意识自动驾驶在动态交通场景中的作用。
https://arxiv.org/abs/2409.11676
Unified information extraction (UIE) aims to complete all information extraction tasks using a single model or framework. While previous work has primarily focused on instruction-tuning large language models (LLMs) with constructed datasets, these methods require significant computational resources and struggle to generalize to unseen tasks. To address these limitations, we propose RUIE (Retrieval-based Unified Information Extraction), a framework that leverages in-context learning to enable rapid generalization while reducing computational costs. The key challenge in RUIE is selecting the most beneficial demonstrations for LLMs to effectively handle diverse IE tasks. To achieve this, we integrate LLM preferences for ranking candidate demonstrations and design a keyword-enhanced reward model to capture fine-grained relationships between queries and demonstrations. We then train a bi-encoder retriever for UIE through contrastive learning and knowledge distillation. To the best of our knowledge, RUIE is the first trainable retrieval framework for UIE. Experimental results on 8 held-out datasets demonstrate RUIE's effectiveness in generalizing to unseen tasks, with average F1-score improvements of 19.22 and 3.13 compared to instruction-tuning methods and other retrievers, respectively. Further analysis confirms RUIE's adaptability to LLMs of varying sizes and the importance of its key components.
统一信息抽取(UIE)旨在使用单个模型或框架完成所有信息抽取任务。虽然之前的工作主要集中在使用构建数据集对大型语言模型(LLMs)进行指令微调,但这些方法需要大量的计算资源,并且很难将到新的任务上进行泛化。为了克服这些限制,我们提出了RUIE(基于检索的统一信息抽取)框架,该框架利用上下文学习来加速泛化,同时降低计算成本。RUIE的关键挑战是选择对LLM最有益的演示,有效地处理多样IE任务。为了实现这一目标,我们将LLM对候选演示的偏好集成到模型中,并设计了一个关键词增强的奖励模型,以捕捉查询和演示之间的微细关系。然后通过对比学习和对称性传播训练 bi-encoder 检索器 for UIE。据我们所知,RUIE是第一个可训练的 UIE 检索框架。在 8 个有代表性的数据集上的实验结果表明,RUIE在泛化到未见任务方面非常有效,平均 F1 分数和改进分别为 19.22 和 3.13,相对于指令微调方法和其他检索器。进一步的分析证实了RUIE对不同大小LLM的适应性以及其关键组件的重要性。
https://arxiv.org/abs/2409.11673
Many language models now enhance their responses with retrieval capabilities, leading to the widespread adoption of retrieval-augmented generation (RAG) systems. However, despite retrieval being a core component of RAG, much of the research in this area overlooks the extensive body of work on fair ranking, neglecting the importance of considering all stakeholders involved. This paper presents the first systematic evaluation of RAG systems integrated with fair rankings. We focus specifically on measuring the fair exposure of each relevant item across the rankings utilized by RAG systems (i.e., item-side fairness), aiming to promote equitable growth for relevant item providers. To gain a deep understanding of the relationship between item-fairness, ranking quality, and generation quality in the context of RAG, we analyze nine different RAG systems that incorporate fair rankings across seven distinct datasets. Our findings indicate that RAG systems with fair rankings can maintain a high level of generation quality and, in many cases, even outperform traditional RAG systems, despite the general trend of a tradeoff between ensuring fairness and maintaining system-effectiveness. We believe our insights lay the groundwork for responsible and equitable RAG systems and open new avenues for future research. We publicly release our codebase and dataset at this https URL.
许多语言模型现在通过检索能力增强其回答,导致广泛采用检索增强生成(RAG)系统。然而,尽管检索是RAG的核心组成部分,该领域的大部分研究都忽视了关于公平排名的广泛工作,忽视了考虑所有利益相关者的重要性。本文是对与公平排名集成的RAG系统进行的第一项系统评估。我们重点关注测量每个相关项目在RAG系统使用的排名中的公平曝光度(即项目侧公平),旨在促进相关项目提供者的公平增长。为了深入了解RAG中项目公平性、排名质量和生成质量之间的关系,我们分析了七种不同数据集的九个不同的RAG系统。我们的研究结果表明,具有公平排名的RAG系统可以保持高水平的生成质量,在许多情况下,甚至超过了传统的RAG系统。尽管保证公平性和保持系统有效性之间存在一般趋势,但我们认为我们的见解为负责任和公平的RAG系统奠定了基础,并为未来的研究打开了新的途径。我们公开发布代码库和数据集的URL为https:// this URL.
https://arxiv.org/abs/2409.11598