Robotic affordances, providing information about what actions can be taken in a given situation, can aid robotic manipulation. However, learning about affordances requires expensive large annotated datasets of interactions or demonstrations. In this work, we argue that well-directed interactions with the environment can mitigate this problem and propose an information-based measure to augment the agent's objective and accelerate the affordance discovery process. We provide a theoretical justification of our approach and we empirically validate the approach both in simulation and real-world tasks. Our method, which we dub IDA, enables the efficient discovery of visual affordances for several action primitives, such as grasping, stacking objects, or opening drawers, strongly improving data efficiency in simulation, and it allows us to learn grasping affordances in a small number of interactions, on a real-world setup with a UFACTORY XArm 6 robot arm.
机器人能力(Robotic affordances)提供了一个特定环境中可以采取的动作信息,有助于机器人操作。然而,了解能力需要花费大量昂贵的大型交互或演示数据集。在这项工作中,我们认为与环境的有效交互可以缓解这个问题,并提出一个基于信息的指标来增强代理者的目标,加速发现能力过程。我们提供了我们方法的理论和实证验证,不仅在模拟中,而且在现实世界的任务中验证了我们的方法。我们的方法,我们称之为IDA,能够有效发现几个动作原语,如抓取、叠放物体或打开抽屉,极大地提高了仿真数据的有效性,并且允许我们在少数交互中学习抓取能力,在一个带有UFACTORY XArm 6机器人手臂的现实生活中设置的学习机器人。
https://arxiv.org/abs/2405.03865
This paper presents a framework for learning state and action abstractions in sequential decision-making domains. Our framework, planning abstraction from language (PARL), utilizes language-annotated demonstrations to automatically discover a symbolic and abstract action space and induce a latent state abstraction based on it. PARL consists of three stages: 1) recovering object-level and action concepts, 2) learning state abstractions, abstract action feasibility, and transition models, and 3) applying low-level policies for abstract actions. During inference, given the task description, PARL first makes abstract action plans using the latent transition and feasibility functions, then refines the high-level plan using low-level policies. PARL generalizes across scenarios involving novel object instances and environments, unseen concept compositions, and tasks that require longer planning horizons than settings it is trained on.
本文提出了一种在序列决策领域学习状态和动作抽象的框架。我们的框架规划抽象语言 (PARL),利用语言注释的演示来自动发现符号和抽象动作空间,并基于它诱导潜在状态抽象。PARL包括三个阶段:1)恢复物体级和动作概念,2)学习状态抽象、抽象动作可行性以及转移模型,3)应用低级策略对抽象动作进行规划。在推理过程中,给定任务描述,PARL首先使用潜在转移和可行性函数进行抽象动作计划,然后通过低级策略优化高级计划。PARL在涉及新颖物体实例和环境的场景、未见过的概念组合以及需要比其训练环境更长的规划视野的任务中具有泛化性。
https://arxiv.org/abs/2405.03864
Automatic personality trait assessment is essential for high-quality human-machine interactions. Systems capable of human behavior analysis could be used for self-driving cars, medical research, and surveillance, among many others. We present a multimodal deep neural network with a Siamese extension for apparent personality trait prediction trained on short video recordings and exploiting modality invariant embeddings. Acoustic, visual, and textual information are utilized to reach high-performance solutions in this task. Due to the highly centralized target distribution of the analyzed dataset, the changes in the third digit are relevant. Our proposed method addresses the challenge of under-represented extreme values, achieves 0.0033 MAE average improvement, and shows a clear advantage over the baseline multimodal DNN without the introduced module.
自动个性特质评估对于高质量的人机交互至关重要。具有人类行为分析能力的系统可以用于自动驾驶汽车、医学研究和监视等领域。我们提出了一个具有Siamese扩展的多模态深度神经网络,用于基于短视频录制的表象个性特质预测,并利用模态不变嵌入。我们利用音频、视觉和文本信息来达到这一任务的高性能解决方案。由于分析数据集的集中目标分布,第三位数字的变革至关重要。我们提出的方法解决了代表性不足的极端值挑战,实现了0.0033 MAE平均改进,并表明与引入模块的基线多模态DNN相比具有明显优势。
https://arxiv.org/abs/2405.03846
Previous studies have demonstrated that proactive interaction with user reviews has a positive impact on the perception of app users and encourages them to submit revised ratings. Nevertheless, developers encounter challenges in managing a high volume of reviews, particularly in the case of popular apps with a substantial influx of daily reviews. Consequently, there is a demand for automated solutions aimed at streamlining the process of responding to user reviews. To address this, we have developed a new system for generating automatic responses by leveraging user-contributed documents with the help of retrieval-augmented generation (RAG) and advanced Large Language Models (LLMs). Our solution, named SCRABLE, represents an adaptive customer review response automation that enhances itself with self-optimizing prompts and a judging mechanism based on LLMs. Additionally, we introduce an automatic scoring mechanism that mimics the role of a human evaluator to assess the quality of responses generated in customer review domains. Extensive experiments and analyses conducted on real-world datasets reveal that our method is effective in producing high-quality responses, yielding improvement of more than 8.5% compared to the baseline. Further validation through manual examination of the generated responses underscores the efficacy our proposed system.
以前的研究表明,积极地与用户评论互动会对应用程序用户的感知产生积极影响,并鼓励他们提交修改后的评分。然而,开发人员在处理大量评论方面遇到了挑战,特别是在拥有大量每日评论的流行应用程序中。因此,人们需要针对应对用户评论的流程进行自动化解决方案的需求。为了应对这个问题,我们开发了一种通过利用用户贡献文档进行自动响应的新系统。我们在这个系统中利用了检索增强生成(RAG)和先进的大语言模型(LLM)来生成自动响应。我们的解决方案SCRABLE代表了一种自适应的客户评论响应自动化,它通过自优化提示和基于LLM的判断机制来增强自己。此外,我们还引入了一个自动评分机制,模仿了人类评估者的角色来评估客户评论领域中生成出的响应的质量。通过在现实世界数据集上进行的大量实验和分析,我们的方法在产生高质量响应方面非常有效,与基线相比改善了8.5%以上。通过手动检查生成的响应进一步验证了我们所提出的系统的有效性。
https://arxiv.org/abs/2405.03845
Recent developments in Large Language Models (LLMs) have significantly expanded their applications across various domains. However, the effectiveness of LLMs is often constrained when operating individually in complex environments. This paper introduces a transformative approach by organizing LLMs into community-based structures, aimed at enhancing their collective intelligence and problem-solving capabilities. We investigate different organizational models-hierarchical, flat, dynamic, and federated-each presenting unique benefits and challenges for collaborative AI systems. Within these structured communities, LLMs are designed to specialize in distinct cognitive tasks, employ advanced interaction mechanisms such as direct communication, voting systems, and market-based approaches, and dynamically adjust their governance structures to meet changing demands. The implementation of such communities holds substantial promise for improve problem-solving capabilities in AI, prompting an in-depth examination of their ethical considerations, management strategies, and scalability potential. This position paper seeks to lay the groundwork for future research, advocating a paradigm shift from isolated to synergistic operational frameworks in AI research and application.
近年来,大型语言模型(LLMs)在各种领域的应用已经显著扩展。然而,当它们在复杂环境中单独运作时,LLMs的有效性常常受到限制。本文提出了一种变革性的方法,通过将LLMs组织成基于社区的结构,旨在增强它们的集体智能和问题解决能力。我们研究了不同的组织模式——层次结构、平面、动态和联邦模式,每个都具有独特的优势和挑战,用于构建协同人工智能系统。在这些组织社区中,LLMs被设计用于专业化的认知任务,采用先进的互动机制,如直接通信、投票系统和市场基于方法,以及动态调整其治理结构以满足不断变化的需求。 such社区的实施在提高人工智能的解决问题能力方面具有巨大的潜力,促使对它们的伦理考虑、管理策略和可扩展性潜力进行深入的研究。本立场论文旨在为未来的研究奠定基础,倡导在人工智能研究和应用中从孤立到协同的运营框架的范式转移。
https://arxiv.org/abs/2405.03825
Everyday devices like light bulbs and kitchen appliances are now embedded with so many features and automated behaviors that they have become complicated to actually use. While such "smart" capabilities can better support users' goals, the task of learning the "ins and outs" of different devices is daunting. Voice assistants aim to solve this problem by providing a natural language interface to devices, yet such assistants cannot understand loosely-constrained commands, they lack the ability to reason about and explain devices' behaviors to users, and they rely on connectivity to intrusive cloud infrastructure. Toward addressing these issues, we propose thoughtful things: devices that leverage lightweight, on-device language models to take actions and explain their behaviors in response to unconstrained user commands. We propose an end-to-end framework that leverages formal modeling, automated training data synthesis, and generative language models to create devices that are both capable and thoughtful in the presence of unconstrained user goals and inquiries. Our framework requires no labeled data and can be deployed on-device, with no cloud dependency. We implement two thoughtful things (a lamp and a thermostat) and deploy them on real hardware, evaluating their practical performance.
日常使用的设备,如灯泡和厨房电器,现在拥有了越来越多的功能和自动化行为,使得它们实际上很难使用。虽然这种“智能”功能可以更好地支持用户的目标,但学习不同设备的“内部运作”仍然具有挑战性。语音助手旨在通过提供自然语言界面来解决这个问题,然而这样的助手无法理解松散的限制性命令,它们无法解释设备的行为给用户,并且它们依赖于连接到侵入性云基础设施的通信。为解决这些问题,我们提出了各种设备的想法:这些设备利用轻量级、本地语言模型采取操作,并解释它们的行为。我们提出了一个端到端的框架,利用形式化建模、自动训练数据合成和生成语言模型来创建能够在无限制用户目标和查询中做出思考的设备。我们的框架无需标记数据,可以部署在设备上,无需依赖云基础设施。我们实现了两个设备(一盏灯和一个恒温器),并将它们部署在实际硬件上,评估它们的实际性能。
https://arxiv.org/abs/2405.03821
Accurate trajectory prediction is crucial for ensuring safe and efficient autonomous driving. However, most existing methods overlook complex interactions between traffic participants that often govern their future trajectories. In this paper, we propose SocialFormer, an agent interaction-aware trajectory prediction method that leverages the semantic relationship between the target vehicle and surrounding vehicles by making use of the road topology. We also introduce an edge-enhanced heterogeneous graph transformer (EHGT) as the aggregator in a graph neural network (GNN) to encode the semantic and spatial agent interaction information. Additionally, we introduce a temporal encoder based on gated recurrent units (GRU) to model the temporal social behavior of agent movements. Finally, we present an information fusion framework that integrates agent encoding, lane encoding, and agent interaction encoding for a holistic representation of the traffic scene. We evaluate SocialFormer for the trajectory prediction task on the popular nuScenes benchmark and achieve state-of-the-art performance.
准确的轨迹预测对于确保安全和高效的自动驾驶至关重要。然而,大多数现有方法忽视了交通参与者之间的复杂交互,这些交互通常决定了他们的未来轨迹。在本文中,我们提出了SocialFormer,一种关注代理与周围车辆之间语义关系的轨迹预测方法,通过利用道路拓扑结构来利用目标车辆与周围车辆之间的语义关系。我们还引入了一个增强的异质图变换器(EHGT)作为图神经网络(GNN)的聚合器,以编码代理的语义和空间交互信息。此外,我们还引入了一个基于门控循环单元(GRU)的时间编码器来建模代理运动的时间社交行为。最后,我们提出了一个整合代理编码、路况编码和代理交互编码的信息融合框架,以对交通场景进行全面的表示。我们在流行的nuScenes基准上评估SocialFormer的轨迹预测任务,并取得了最先进的性能。
https://arxiv.org/abs/2405.03809
Recent advancements in Large Language Models (LLMs) have led to the development of Video Large Multi-modal Models (Video-LMMs) that can handle a wide range of video understanding tasks. These models have the potential to be deployed in real-world applications such as robotics, AI assistants, medical imaging, and autonomous vehicles. The widespread adoption of Video-LMMs in our daily lives underscores the importance of ensuring and evaluating their robust performance in mirroring human-like reasoning and interaction capabilities in complex, real-world contexts. However, existing benchmarks for Video-LMMs primarily focus on general video comprehension abilities and neglect assessing their reasoning capabilities over complex videos in the real-world context, and robustness of these models through the lens of user prompts as text queries. In this paper, we present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES), a novel benchmark that comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions. We evaluate 9 recent models, including both open-source and closed-source variants, and find that most of the Video-LMMs, {especially open-source ones,} struggle with robustness and reasoning when dealing with complex videos. Based on our analysis, we develop a training-free Dual-Step Contextual Prompting (DSCP) technique to enhance the performance of existing Video-LMMs. Our findings provide valuable insights for building the next generation of human-centric AI systems with advanced robustness and reasoning capabilities. Our dataset and code are publicly available at: this https URL.
近年来,在大型语言模型(LLMs)方面取得了显著进展,导致开发了视频大型多模态模型(Video-LMMs),这些模型可以处理广泛的视频理解任务。这些模型在现实世界的应用场景,如机器人学、人工智能助手、医学成像和自动驾驶车辆等方面具有潜在部署价值。在我们日常生活中广泛部署Video-LMMs,突显了在复杂、现实世界语境中确保和评估其稳健性能的重要性。然而,现有的Video-LMM基准主要关注于通用视频理解能力,而忽略了评估其在复杂视频中的推理能力,以及通过用户提示的视角评估模型的稳健性。在本文中,我们提出了CVRR-ES(复杂视频推理和稳健性评估套装),一种新颖的基准,全面评估了11种不同现实世界视频维度中Video-LMM的性能。我们对9个最近的后开源和闭源模型进行了评估,发现大多数视频模型(尤其是开源模型)在处理复杂视频时,表现出了稳健性和推理能力不足。根据我们的分析,我们提出了一种无需训练的自助双步上下文提示(DSCP)技术,以提高现有视频模型的性能。我们的研究结果为构建具有高级稳健性和推理能力的下一代人类中心化人工智能系统提供了宝贵的洞见。我们的数据和代码 publicly available at:this <https:// this URL.
https://arxiv.org/abs/2405.03690
We present a zero-shot pose optimization method that enforces accurate physical contact constraints when estimating the 3D pose of humans. Our central insight is that since language is often used to describe physical interaction, large pretrained text-based models can act as priors on pose estimation. We can thus leverage this insight to improve pose estimation by converting natural language descriptors, generated by a large multimodal model (LMM), into tractable losses to constrain the 3D pose optimization. Despite its simplicity, our method produces surprisingly compelling pose reconstructions of people in close contact, correctly capturing the semantics of the social and physical interactions. We demonstrate that our method rivals more complex state-of-the-art approaches that require expensive human annotation of contact points and training specialized models. Moreover, unlike previous approaches, our method provides a unified framework for resolving self-contact and person-to-person contact.
我们提出了一种零 shot姿态优化方法,在估计人类3D姿态时强制准确的身体接触约束。我们核心的见解是,由于语言通常用来描述物理交互,因此大型预训练文本模型可以作为姿态估计的先验。因此,我们可以利用这个见解来通过将自然语言描述符转换为可求解的损失来约束3D姿态优化,从而提高姿态估计。尽管我们的方法很简单,但通过将大型多模态模型(LMM)生成的自然语言描述符转换为可求解的损失,我们成功地捕捉到了社会和物理交互的语义。我们证明了我们的方法与需要昂贵的人类标注接触点和训练专用模型的更复杂方法匹敌。此外,与以前的方法不同,我们的方法提供了一个统一的框架来解决自接触和人与人接触。
https://arxiv.org/abs/2405.03689
Bimanual manipulation is a longstanding challenge in robotics due to the large number of degrees of freedom and the strict spatial and temporal synchronization required to generate meaningful behavior. Humans learn bimanual manipulation skills by watching other humans and by refining their abilities through play. In this work, we aim to enable robots to learn bimanual manipulation behaviors from human video demonstrations and fine-tune them through interaction. Inspired by seminal work in psychology and biomechanics, we propose modeling the interaction between two hands as a serial kinematic linkage -- as a screw motion, in particular, that we use to define a new action space for bimanual manipulation: screw actions. We introduce ScrewMimic, a framework that leverages this novel action representation to facilitate learning from human demonstration and self-supervised policy fine-tuning. Our experiments demonstrate that ScrewMimic is able to learn several complex bimanual behaviors from a single human video demonstration, and that it outperforms baselines that interpret demonstrations and fine-tune directly in the original space of motion of both arms. For more information and video results, this https URL
手动操作是一个在机器人领域长期存在的挑战,由于需要大量自由度和严格的空间和时间同步来产生有意义的动作,使得实现有意义的行为变得具有挑战性。人类通过观察其他人类并通过游戏来提高他们的能力来学习双手操作技能。在这项工作中,我们旨在使机器人能够从人类视频演示中学习双手操作行为,并通过互动对其进行微调。受到心理学和生物力学中关键工作的启发,我们提出将两个手的交互建模为串行运动学链接——特别是螺钉运动,作为我们定义一个新的双手操作空间的方式:螺钉操作。我们引入了ScrewMimic框架,该框架利用这种新颖的动作表示来促进从人类演示中学习技能和自我监督策略微调。我们的实验结果表明,ScrewMimic能够从单个人类视频演示中学习多个复杂双手操作行为,并且它优于那些在原始运动空间中解释演示并进行微调的基线。更多信息和视频结果,请访问此链接:https:// URL
https://arxiv.org/abs/2405.03666
Zero-shot learning (ZSL) aims to recognize novel classes through transferring shared semantic knowledge (e.g., attributes) from seen classes to unseen classes. Recently, attention-based methods have exhibited significant progress which align visual features and attributes via a spatial attention mechanism. However, these methods only explore visual-semantic relationship in the spatial dimension, which can lead to classification ambiguity when different attributes share similar attention regions, and semantic relationship between attributes is rarely discussed. To alleviate the above problems, we propose a Dual Relation Mining Network (DRMN) to enable more effective visual-semantic interactions and learn semantic relationship among attributes for knowledge transfer. Specifically, we introduce a Dual Attention Block (DAB) for visual-semantic relationship mining, which enriches visual information by multi-level feature fusion and conducts spatial attention for visual to semantic embedding. Moreover, an attribute-guided channel attention is utilized to decouple entangled semantic features. For semantic relationship modeling, we utilize a Semantic Interaction Transformer (SIT) to enhance the generalization of attribute representations among images. Additionally, a global classification branch is introduced as a complement to human-defined semantic attributes, and we then combine the results with attribute-based classification. Extensive experiments demonstrate that the proposed DRMN leads to new state-of-the-art performances on three standard ZSL benchmarks, i.e., CUB, SUN, and AwA2.
零样本学习(ZSL)旨在通过将可见类别的共享语义知识(例如属性)传递给不可见类别的类别来识别新颖类别。最近,基于注意力的方法在空间注意机制的帮助下表现出了显著的进展,这些方法通过空间注意力机制将视觉特征和属性对齐。然而,这些方法仅在空间维度上探索视觉语义关系,这可能导致不同属性的注意力区域相似时出现分类不确定性,并且很少讨论属性之间的语义关系。为了缓解上述问题,我们提出了一个双关系挖掘网络(DRMN),以实现更有效的视觉-语义交互并学习属性之间的语义关系。具体来说,我们引入了一个双注意块(DAB)进行视觉-语义关系挖掘,通过多级特征融合丰富视觉信息并进行空间注意力对齐。此外,还利用了属性引导的通道注意来解耦相关的语义特征。为了语义关系建模,我们利用了语义交互转置器(SIT)来增强图像中属性表示的泛化。此外,我们还引入了一个全局分类分支作为人类定义语义属性的补充,然后将结果与基于属性的分类相结合。大量实验证明,与预训练模型相比,DRMN在三个标准的ZSL基准测试集(即CUB、SUN和AwA2)上取得了最先进的性能。
https://arxiv.org/abs/2405.03613
AI agents are commonly trained with large datasets of demonstrations of human behavior. However, not all behaviors are equally safe or desirable. Desired characteristics for an AI agent can be expressed by assigning desirability scores, which we assume are not assigned to individual behaviors but to collective trajectories. For example, in a dataset of vehicle interactions, these scores might relate to the number of incidents that occurred. We first assess the effect of each individual agent's behavior on the collective desirability score, e.g., assessing how likely an agent is to cause incidents. This allows us to selectively imitate agents with a positive effect, e.g., only imitating agents that are unlikely to cause incidents. To enable this, we propose the concept of an agent's Exchange Value, which quantifies an individual agent's contribution to the collective desirability score. The Exchange Value is the expected change in desirability score when substituting the agent for a randomly selected agent. We propose additional methods for estimating Exchange Values from real-world datasets, enabling us to learn desired imitation policies that outperform relevant baselines. The project website can be found at this https URL.
AI agents通常通过训练大量的人类行为数据来进行共同学习。然而,并不是所有的行为都是安全和有用的。期望的AI代理特征可以通过分配吸引力分数来表达,我们假设这些分数不是分配给单个行为的,而是分配给集体轨迹。例如,在车辆互动数据集中,这些分数可能与发生的事故数量有关。首先,我们评估每个单独代理行为对集体吸引力分数的影响,例如评估一个代理引起事故的可能性。这使我们能够选择性地模仿具有积极影响效应的代理,例如,只模仿不太可能引起事故的代理。为了实现这一目标,我们提出了代理的交换价值概念,该概念衡量了一个代理对集体吸引力分数的贡献。交换价值是用随机选择一个代理来代替该代理时,期望的吸引力分数的变化。我们还提出了从现实世界数据中估计交换价值的方法,使我们能够学习具有优异表现的相关基线之外的需求模仿策略。项目网站可以在此处找到:https://www.project-url.com/
https://arxiv.org/abs/2405.03735
Acoustic scene classification (ASC) is highly important in the real world. Recently, deep learning-based methods have been widely employed for acoustic scene classification. However, these methods are currently not lightweight enough as well as their performance is not satisfactory. To solve these problems, we propose a deep space separable distillation network. Firstly, the network performs high-low frequency decomposition on the log-mel spectrogram, significantly reducing computational complexity while maintaining model performance. Secondly, we specially design three lightweight operators for ASC, including Separable Convolution (SC), Orthonormal Separable Convolution (OSC), and Separable Partial Convolution (SPC). These operators exhibit highly efficient feature extraction capabilities in acoustic scene classification tasks. The experimental results demonstrate that the proposed method achieves a performance gain of 9.8% compared to the currently popular deep learning methods, while also having smaller parameter count and computational complexity.
声场分类(ASC)在现实生活中具有非常重要的意义。然而,基于深度学习的ASC方法目前还不够轻量级,并且其性能也不能令人满意。为了解决这些问题,我们提出了一个深度可分离分蒸馏网络。首先,网络在时域对离散余弦图进行高-低频分解,从而大大降低计算复杂度,同时保持模型性能。其次,我们专门设计三种轻量级的ASC操作符,包括分离卷积(SC)、正交分离卷积(OSC)和分离部分卷积(SPC)。这些操作符在声场分类任务中具有高度高效的特征提取能力。实验结果表明,与目前流行的深度学习方法相比,所提出的方法实现了9.8%的性能提升,同时具有更小的参数数量和计算复杂度。
https://arxiv.org/abs/2405.03567
General world models represent a crucial pathway toward achieving Artificial General Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual environments to decision-making systems. Recently, the emergence of the Sora model has attained significant attention due to its remarkable simulation capabilities, which exhibits an incipient comprehension of physical laws. In this survey, we embark on a comprehensive exploration of the latest advancements in world models. Our analysis navigates through the forefront of generative methodologies in video generation, where world models stand as pivotal constructs facilitating the synthesis of highly realistic visual content. Additionally, we scrutinize the burgeoning field of autonomous-driving world models, meticulously delineating their indispensable role in reshaping transportation and urban mobility. Furthermore, we delve into the intricacies inherent in world models deployed within autonomous agents, shedding light on their profound significance in enabling intelligent interactions within dynamic environmental contexts. At last, we examine challenges and limitations of world models, and discuss their potential future directions. We hope this survey can serve as a foundational reference for the research community and inspire continued innovation. This survey will be regularly updated at: this https URL.
通用世界模型代表了一种关键途径,有助于实现人工通用智能(AGI),并为各种应用提供基石,从虚拟环境到决策系统。最近,Sora模型的出现因其出色的模拟能力而受到广泛关注,它表现出对物理定律的初步理解。在本次调查中,我们全面探讨了世界模型的最新进展。我们的分析涵盖了视频生成领域的前沿生成方法,其中世界模型作为关键构建模块促进高度现实主义视觉内容的合成。此外,我们详细研究了自动驾驶世界模型的蓬勃发展,精心描绘了它们在重塑交通和城市出行方式中不可或缺的作用。最后,我们深入研究了部署在自主代理中的世界模型的复杂性,揭示了它们在 enabling intelligent interactions within dynamic environmental contexts中的深刻意义。总之,我们调查了世界模型的挑战和局限性,并讨论了它们未来可能的发展方向。我们希望这次调查能为研究社区提供一种基础性的参考,并激发持续创新。本次调查将定期更新于:https:// this URL。
https://arxiv.org/abs/2405.03520
Learning from Demonstration allows robots to mimic human actions. However, these methods do not model constraints crucial to ensure safety of the learned skill. Moreover, even when explicitly modelling constraints, they rely on the assumption of a known cost function, which limits their practical usability for task with unknown cost. In this work we propose a two-step optimization process that allow to estimate cost and constraints by decoupling the learning of cost functions from the identification of unknown constraints within the demonstrated trajectories. Initially, we identify the cost function by isolating the effect of constraints on parts of the demonstrations. Subsequently, a constraint leaning method is used to identify the unknown constraints. Our approach is validated both on simulated trajectories and a real robotic manipulation task. Our experiments show the impact that incorrect cost estimation has on the learned constraints and illustrate how the proposed method is able to infer unknown constraints, such as obstacles, from demonstrated trajectories without any initial knowledge of the cost.
通过演示学习可以让机器人模仿人类的行为。然而,这些方法并未对确保学习技能的安全性至关重要的约束进行建模。此外,即使明确建模了约束,它们也依赖于已知成本函数的假设,这限制了它们在未知成本任务上的实用性。在这项工作中,我们提出了一个两步优化过程,允许通过将学习成本函数的影响与演示轨迹中未知约束的识别分离,来估计成本和约束。最初,我们通过隔离约束对演示轨迹部分的影响来确定成本函数。然后,使用约束倾向方法来识别未知的约束。我们对该方法在模拟轨迹和真实机器人操作任务上的验证表明,不正确的成本估计会对学习到的约束产生影响,并说明该方法能够从演示轨迹中推断出未知的约束,例如障碍物,而无需具备任何初始成本知识。
https://arxiv.org/abs/2405.03491
Implementing virtual fixtures in guiding tasks constrains the movement of the robot's end effector to specific curves within its workspace. However, incorporating guiding frameworks may encounter discontinuities when optimizing the reference target position to the nearest point relative to the current robot position. This article aims to give a geometric interpretation of such discontinuities, with specific reference to the commonly adopted Gauss-Newton algorithm. The effect of such discontinuities, defined as Euclidean Distance Singularities, is experimentally proved. We then propose a solution that is based on a Linear Quadratic Tracking problem with minimum jerk command, then compare and validate the performances of the proposed framework in two different human-robot interaction scenarios.
在引导任务中实现虚拟 fixtures 限制了机器人末端执行器的运动,使其在工作空间内沿着特定的曲线运动。然而,在将引导框架集成到机器人中时,在优化参考目标位置与当前机器人位置的最近点之间时,可能会遇到平滑曲线。本文旨在给出这种不连续性的几何解释,并特别针对通常采用的高斯-牛顿算法进行说明。这种不连续性,定义为欧氏距离奇点,已通过实验得到了证明。然后我们提出了一个基于线性二次规划问题最小加速度命令的解决方案,并比较和验证了在两种不同的人机交互场景中,所提出的框架的性能。
https://arxiv.org/abs/2405.03473
This paper explores how deep learning techniques can improve visual-based SLAM performance in challenging environments. By combining deep feature extraction and deep matching methods, we introduce a versatile hybrid visual SLAM system designed to enhance adaptability in challenging scenarios, such as low-light conditions, dynamic lighting, weak-texture areas, and severe jitter. Our system supports multiple modes, including monocular, stereo, monocular-inertial, and stereo-inertial configurations. We also perform analysis how to combine visual SLAM with deep learning methods to enlighten other researches. Through extensive experiments on both public datasets and self-sampled data, we demonstrate the superiority of the SL-SLAM system over traditional approaches. The experimental results show that SL-SLAM outperforms state-of-the-art SLAM algorithms in terms of localization accuracy and tracking robustness. For the benefit of community, we make public the source code at this https URL.
本文探讨了深度学习技术如何通过结合深度特征提取和深度匹配方法来提高基于视觉的SLAM在具有挑战性的环境中的性能。通过结合深度特征提取和深度匹配方法,我们引入了一种多功能的混合视觉SLAM系统,旨在增强在具有挑战性的场景中的适应性,例如低光条件、动态照明、弱纹理区和严重抖动。我们的系统支持多种模式,包括单目、双目、单目-惯性和平面-惯性配置。我们还进行了分析,探讨了如何将视觉SLAM与深度学习方法相结合以启发其他研究者。通过在公开数据集和自采样数据上进行广泛的实验,我们证明了SL-SLAM系统与传统方法相比具有优越性。实验结果表明,SL-SLAM在定位精度和跟踪鲁棒性方面优于最先进的SLAM算法。为了造福社区,我们将SL-SLAM的源代码公开在以下链接处:
https://arxiv.org/abs/2405.03413
Failure mode and effects analysis (FMEA) is a systematic approach to identify and analyse potential failures and their effects in a system or process. The FMEA approach, however, requires domain experts to manually analyse the FMEA model to derive risk-reducing actions that should be applied. In this paper, we provide a formal framework to allow for automatic planning and acting in FMEA models. More specifically, we cast the FMEA model into a Markov decision process which can then be solved by existing solvers. We show that the FMEA approach can not only be used to support medical experts during the modelling process but also to automatically derive optimal therapies for the treatment of patients.
失败模式和效果分析(FMEA)是一种系统性的方法,用于识别和分析系统或过程中可能出现的故障及其影响。然而,FMEA方法要求领域专家手动分析FMEA模型以得出应该采取的降低风险的行动。在本文中,我们提供了一个框架,允许在FMEA模型中自动规划和采取行动。具体来说,我们将FMEA模型转化为马尔可夫决策过程,然后由现有的求解器来求解。我们证明了FMEA方法不仅可以支持医学专家在建模过程中,还可以自动得出治疗患者的最佳疗法。
https://arxiv.org/abs/2405.03406
Given the rapid advancement of large-scale language models, artificial intelligence (AI) models, like ChatGPT, are playing an increasingly prominent role in human society. However, to ensure that artificial intelligence models benefit human society, we must first fully understand the similarities and differences between the human-like characteristics exhibited by artificial intelligence models and real humans, as well as the cultural stereotypes and biases that artificial intelligence models may exhibit in the process of interacting with humans. This study first measured ChatGPT in 84 dimensions of psychological characteristics, revealing differences between ChatGPT and human norms in most dimensions as well as in high-dimensional psychological representations. Additionally, through the measurement of ChatGPT in 13 dimensions of cultural values, it was revealed that ChatGPT's cultural value patterns are dissimilar to those of various countries/regions worldwide. Finally, an analysis of ChatGPT's performance in eight decision-making tasks involving interactions with humans from different countries/regions revealed that ChatGPT exhibits clear cultural stereotypes in most decision-making tasks and shows significant cultural bias in third-party punishment and ultimatum games. The findings indicate that, compared to humans, ChatGPT exhibits a distinct psychological profile and cultural value orientation, and it also shows cultural biases and stereotypes in interpersonal decision-making. Future research endeavors should emphasize enhanced technical oversight and augmented transparency in the database and algorithmic training procedures to foster more efficient cross-cultural communication and mitigate social disparities.
鉴于大型语言模型和人工智能(AI)模型的快速发展,AI模型正越来越多地应用于人类社会。然而,为了确保AI模型有益于人类社会,我们首先必须全面了解AI模型展示的人类特征与真实人类之间的相似性和差异,以及AI模型在与人互动过程中可能表现出的文化刻板印象和偏见。这项研究首先对ChatGPT在84个心理特征方面进行了测量,揭示了ChatGPT与人类标准在大多数方面以及高维心理表示中的差异。此外,通过衡量ChatGPT在13个文化价值方面的表现,发现ChatGPT的文化价值模式与全球各国/地区的文化价值模式存在差异。最后,对ChatGPT在与不同国家/地区的人的互动中的表现进行八项决策任务的分析表明,ChatGPT在大多数决策任务中表现出明显的文化刻板印象,在第三方惩罚和悬赏游戏中的文化偏见表现出显著性。这些发现表明,与人类相比,ChatGPT表现出独特的心理特征和价值取向,还表现出人际决策中的文化偏见和刻板印象。未来的研究应该强调在数据库和算法训练过程中加强技术监督和增强透明度,以促进更有效的跨文化沟通,缓解社会不平等。
https://arxiv.org/abs/2405.03387
Reinforcement learning (RL) presents a promising framework to learn policies through environment interaction, but often requires an infeasible amount of interaction data to solve complex tasks from sparse rewards. One direction includes augmenting RL with offline data demonstrating desired tasks, but past work often require a lot of high-quality demonstration data that is difficult to obtain, especially for domains such as robotics. Our approach consists of a reverse curriculum followed by a forward curriculum. Unique to our approach compared to past work is the ability to efficiently leverage more than one demonstration via a per-demonstration reverse curriculum generated via state resets. The result of our reverse curriculum is an initial policy that performs well on a narrow initial state distribution and helps overcome difficult exploration problems. A forward curriculum is then used to accelerate the training of the initial policy to perform well on the full initial state distribution of the task and improve demonstration and sample efficiency. We show how the combination of a reverse curriculum and forward curriculum in our method, RFCL, enables significant improvements in demonstration and sample efficiency compared against various state-of-the-art learning-from-demonstration baselines, even solving previously unsolvable tasks that require high precision and control.
强化学习(RL)通过环境交互来学习策略是一个有前途的框架,但通常需要大量的交互数据来解决复杂任务。一方面包括通过离线数据增强 RL,以展示所需任务的强化学习,但先前的研究表明,通常需要大量高质量的演示数据,这很难获得,尤其是在机器人领域。我们的方法包括一个反向课程和一个正向课程。与先前的研究相比,我们方法的独特之处在于能够通过通过状态重置生成的每个演示的逆向课程来有效地利用多个演示。反向课程的结果是一个在窄的初始状态分布上表现良好的初始策略,并帮助克服困难的探索问题。然后使用正向课程来加速初始策略的训练,以在任务的全初始状态分布上表现良好,并提高演示和采样效率。我们证明了,在我们的方法 RFCL 中,结合反向课程和正向课程,能够显著提高演示和采样效率,与各种从演示中学习的基线相比,甚至解决了以前无法解决的任务,这些任务需要高精度和控制。
https://arxiv.org/abs/2405.03379