Multimodal Large Language Models (MLLMs) have shown impressive reasoning abilities and general intelligence in various domains. It inspires researchers to train end-to-end MLLMs or utilize large models to generate policies with human-selected prompts for embodied agents. However, these methods exhibit limited generalization capabilities on unseen tasks or scenarios, and overlook the multimodal environment information which is critical for robots to make decisions. In this paper, we introduce a novel Robotic Multimodal Perception-Planning (RoboMP$^2$) framework for robotic manipulation which consists of a Goal-Conditioned Multimodal Preceptor (GCMP) and a Retrieval-Augmented Multimodal Planner (RAMP). Specially, GCMP captures environment states by employing a tailored MLLMs for embodied agents with the abilities of semantic reasoning and localization. RAMP utilizes coarse-to-fine retrieval method to find the $k$ most-relevant policies as in-context demonstrations to enhance the planner. Extensive experiments demonstrate the superiority of RoboMP$^2$ on both VIMA benchmark and real-world tasks, with around 10% improvement over the baselines.
多模态大型语言模型(MLLMs)在各种领域表现出惊人的推理能力和通用智能。这激发了研究人员训练端到端MLLMs或利用大型模型生成具有人类选择的提示的身体代理策略。然而,这些方法在未见过的任务或场景上表现出有限的泛化能力,并忽略了对于机器人做出决策至关重要的多模态环境信息。在本文中,我们引入了一种名为RoboMP$^2$的机器人多模态感知规划(RoboMP)框架,用于机器人操作。该框架包括一个由自适应MLLM捕获环境状态的目标条件式多模态感知器(GCMP)和一个用于增强规划器检索策略的检索增强多模态规划器(RAMP)。特别地,GCMP通过为具有语义推理和局部定位能力的身体代理使用定制的MLLM来捕获环境状态。RAMP利用粗到细的检索方法找到$k$个最有相关的策略,作为上下文的演示以提高规划器。大量实验证明,RoboMP$^2在VIMA基准和现实世界任务上具有优越性,与基线相比约10%的改进。
https://arxiv.org/abs/2404.04929
With the power of large language models (LLMs), open-ended embodied agents can flexibly understand human instructions, generate interpretable guidance strategies, and output executable actions. Nowadays, Multi-modal Language Models~(MLMs) integrate multi-modal signals into LLMs, further bringing richer perception to entity agents and allowing embodied agents to perceive world-understanding tasks more delicately. However, existing works: 1) operate independently by agents, each containing multiple LLMs, from perception to action, resulting in gaps between complex tasks and execution; 2) train MLMs on static data, struggling with dynamics in open-ended scenarios; 3) input prior knowledge directly as prompts, suppressing application flexibility. We propose STEVE-2, a hierarchical knowledge distillation framework for open-ended embodied tasks, characterized by 1) a hierarchical system for multi-granular task division, 2) a mirrored distillation method for parallel simulation data, and 3) an extra expert model for bringing additional knowledge into parallel simulation. After distillation, embodied agents can complete complex, open-ended tasks without additional expert guidance, utilizing the performance and knowledge of a versatile MLM. Extensive evaluations on navigation and creation tasks highlight the superior performance of STEVE-2 in open-ended tasks, with $1.4 \times$ - $7.3 \times$ in performance.
借助大型语言模型的力量,开放性代理可以灵活地理解人类指令,生成可解释的指导策略,并输出可执行动作。如今,多模态语言模型(MLMs)将多模态信号集成到LLM中,进一步增加了实体代理对复杂任务的感知,并允许实体代理更细致地感知世界理解任务。然而,现有作品:1)由代理独立操作,从感知到动作,导致复杂任务之间的缺口;2)在静态数据上训练MLMs,难以应对开放性场景中的动态;3)将先验知识直接作为提示输入,抑制了应用的灵活性。我们提出了STEVE-2,一个为开放性 embodied 任务提供层次化知识蒸馏框架,其特点为:1)多粒度任务分层的 hierarchical 系统,2)用于并行模拟数据的镜像蒸馏方法,3)用于引入额外知识的额外专家模型。蒸馏后,实体代理可以在没有额外专家指导的情况下完成复杂的、开放性的任务,利用多样 MLM 的性能和知识。对导航和创建任务的广泛评估强调了STEVE-2在开放性任务中的优越性能,性能比为 $1.4 \times$ - $7.3 \times$。
https://arxiv.org/abs/2404.04619
We present an embodied AI system which receives open-ended natural language instructions from a human, and controls two arms to collaboratively accomplish potentially long-horizon tasks over a large workspace. Our system is modular: it deploys state of the art Large Language Models for task planning,Vision-Language models for semantic perception, and Point Cloud transformers for grasping. With semantic and physical safety in mind, these modules are interfaced with a real-time trajectory optimizer and a compliant tracking controller to enable human-robot proximity. We demonstrate performance for the following tasks: bi-arm sorting, bottle opening, and trash disposal tasks. These are done zero-shot where the models used have not been trained with any real world data from this bi-arm robot, scenes or workspace.Composing both learning- and non-learning-based components in a modular fashion with interpretable inputs and outputs allows the user to easily debug points of failures and fragilities. One may also in-place swap modules to improve the robustness of the overall platform, for instance with imitation-learned policies.
我们提出了一个基于 embodied AI 系统的多模块体系结构,该系统接受人类提供的开放性自然语言指令,并控制两个臂协同完成可能涉及较长时间间隔的大型工作空间的任务。我们的系统具有模块化特性:它采用了最先进的自然语言处理模型进行任务规划,视觉语言模型进行语义感知,以及点云变换器进行抓取。在考虑语义和物理安全性的前提下,这些模块通过与实时轨迹优化器和符合跟踪控制器的接口相连,实现了与机器人的人机亲近。我们展示了以下任务的性能:双臂分类、开瓶和垃圾处理任务。这些任务是在没有训练过任何现实世界的数据的情况下完成的,包括这个双臂机器人、场景或工作空间。 以模块化方式,将学习和非学习组件组合在一起,并提供可解释的输入和输出,使用户能够轻松地诊断故障和脆弱点。还可以在本地交换模块以提高整个平台的健壮性,例如通过模仿学习策略。
https://arxiv.org/abs/2404.03570
Robotic technologies have been an indispensable part for improving human productivity since they have been helping humans in completing diverse, complex, and intensive tasks in a fast yet accurate and efficient way. Therefore, robotic technologies have been deployed in a wide range of applications, ranging from personal to industrial use-cases. However, current robotic technologies and their computing paradigm still lack embodied intelligence to efficiently interact with operational environments, respond with correct/expected actions, and adapt to changes in the environments. Toward this, recent advances in neuromorphic computing with Spiking Neural Networks (SNN) have demonstrated the potential to enable the embodied intelligence for robotics through bio-plausible computing paradigm that mimics how the biological brain works, known as "neuromorphic artificial intelligence (AI)". However, the field of neuromorphic AI-based robotics is still at an early stage, therefore its development and deployment for solving real-world problems expose new challenges in different design aspects, such as accuracy, adaptability, efficiency, reliability, and security. To address these challenges, this paper will discuss how we can enable embodied neuromorphic AI for robotic systems through our perspectives: (P1) Embodied intelligence based on effective learning rule, training mechanism, and adaptability; (P2) Cross-layer optimizations for energy-efficient neuromorphic computing; (P3) Representative and fair benchmarks; (P4) Low-cost reliability and safety enhancements; (P5) Security and privacy for neuromorphic computing; and (P6) A synergistic development for energy-efficient and robust neuromorphic-based robotics. Furthermore, this paper identifies research challenges and opportunities, as well as elaborates our vision for future research development toward embodied neuromorphic AI for robotics.
机器人技术在提高人类生产力的过程中一直是一个不可或缺的部分,因为他们以快速、准确、高效的方式帮助人类完成各种复杂、密集的任务。因此,机器人技术已经在个人到工业应用范围内得到了广泛应用。然而,目前的机器人技术和计算范式仍然缺乏肢体智能,无法有效地与操作环境互动,对正确的/预期动作作出反应,并适应环境变化。为此,近年来关于神经形态计算(SNN)的神经网络的进步展示了通过类生物计算范式实现机器人肢体智能的可能性,这种范式模仿了生物大脑的工作方式,被称为“神经形态人工智能(AI)”。然而,基于神经形态人工智能的机器人领域仍然处于早期阶段,因此其开发和部署为解决现实问题暴露了在设计方面的新挑战,例如准确性、适应性、效率、可靠性和安全性。为应对这些挑战,本文将探讨通过我们的观点如何实现机器人系统中的肢体智能神经形态人工智能(AI)的方法:(P1)基于有效学习规则、训练机制和适应性的肢体智能;(P2)跨层优化实现能源高效的神经形态计算;(P3)具有代表性且公正的基准;(P4)低成本可靠性和安全增强;(P5)神经形态计算的安全和隐私;(P6)能源效率高和鲁棒性强的神经形态机器人发展。此外,本文还识别出研究挑战和机会,并阐述了我们对未来机器人研究开发中肢体智能神经形态人工智能的愿景。
https://arxiv.org/abs/2404.03325
This paper explores the integration of linguistic inputs within robotic navigation systems, drawing upon the symbol interdependency hypothesis to bridge the divide between symbolic and embodied cognition. It examines previous work incorporating language and semantics into Neural Network (NN) and Simultaneous Localization and Mapping (SLAM) approaches, highlighting how these integrations have advanced the field. By contrasting abstract symbol manipulation with sensory-motor grounding, we propose a unified framework where language functions both as an abstract communicative system and as a grounded representation of perceptual experiences. Our review of cognitive models of distributional semantics and their application to autonomous agents underscores the transformative potential of language-integrated systems.
本文探讨了在机器人导航系统中集成语言输入的问题,并借鉴符号互依性假设来弥合符号和身体认知之间的分歧。它回顾了将语言和语义融入神经网络(NN)和同时定位与映射(SLAM)方法中的先驱工作,并强调了这些整合如何推动该领域的发展。通过将抽象符号操作与感知-运动 groundeding 相比较,我们提出了一个统一框架,其中语言既作为抽象交流系统,又作为感知经历的 grounded 表示。我们对分布式语义模型的认知模型及其应用于自主机器人的回顾强调了语言集成系统的变革潜力。
https://arxiv.org/abs/2404.03049
Recent advancements in language models have demonstrated their adeptness in conducting multi-turn dialogues and retaining conversational context. However, this proficiency remains largely unexplored in other multimodal generative models, particularly in human motion models. By integrating multi-turn conversations in controlling continuous virtual human movements, generative human motion models can achieve an intuitive and step-by-step process of human task execution for humanoid robotics, game agents, or other embodied systems. In this work, we present MotionChain, a conversational human motion controller to generate continuous and long-term human motion through multimodal prompts. Specifically, MotionChain consists of multi-modal tokenizers that transform various data types such as text, image, and motion, into discrete tokens, coupled with a Vision-Motion-aware Language model. By leveraging large-scale language, vision-language, and vision-motion data to assist motion-related generation tasks, MotionChain thus comprehends each instruction in multi-turn conversation and generates human motions followed by these prompts. Extensive experiments validate the efficacy of MotionChain, demonstrating state-of-the-art performance in conversational motion generation, as well as more intuitive manners of controlling and interacting with virtual humans.
近年来在语言模型的进步中,已经证明了它们在多轮对话和保留会话上下文方面的能力。然而,在多模态生成模型中,尤其是在人体运动模型中,这种能力仍然没有被广泛探索。通过将多轮对话集成到控制连续虚拟人运动中,生成式人体运动模型可以实现人类任务执行的直观和逐步过程,适用于人形机器人、游戏代理或其他嵌入式系统。在这项工作中,我们提出了MotionChain,一种通过多模态提示生成连续和长时间的人体运动控制器。具体来说,MotionChain由多模态词表组成,将各种数据类型如文本、图像和运动转换为离散的标记,并与Vision-Motion- aware 语言模型耦合。通过利用大规模语言、视觉语言和视觉运动数据来辅助运动相关生成任务,MotionChain因此可以理解多轮对话中的每个指令,并生成跟随这些提示的人体运动。大量实验证实了MotionChain的有效性,证明了其在会话运动生成方面的卓越性能,以及更直观的人机交互方式。
https://arxiv.org/abs/2404.01700
The vulnerability of deep neural networks to adversarial patches has motivated numerous defense strategies for boosting model robustness. However, the prevailing defenses depend on single observation or pre-established adversary information to counter adversarial patches, often failing to be confronted with unseen or adaptive adversarial attacks and easily exhibiting unsatisfying performance in dynamic 3D environments. Inspired by active human perception and recurrent feedback mechanisms, we develop Embodied Active Defense (EAD), a proactive defensive strategy that actively contextualizes environmental information to address misaligned adversarial patches in 3D real-world settings. To achieve this, EAD develops two central recurrent sub-modules, i.e., a perception module and a policy module, to implement two critical functions of active vision. These models recurrently process a series of beliefs and observations, facilitating progressive refinement of their comprehension of the target object and enabling the development of strategic actions to counter adversarial patches in 3D environments. To optimize learning efficiency, we incorporate a differentiable approximation of environmental dynamics and deploy patches that are agnostic to the adversary strategies. Extensive experiments demonstrate that EAD substantially enhances robustness against a variety of patches within just a few steps through its action policy in safety-critical tasks (e.g., face recognition and object detection), without compromising standard accuracy. Furthermore, due to the attack-agnostic characteristic, EAD facilitates excellent generalization to unseen attacks, diminishing the averaged attack success rate by 95 percent across a range of unseen adversarial attacks.
深度神经网络对对抗性补丁的脆弱性激起了许多提高模型鲁棒性的防御策略。然而,现有的防御方法依赖于单个观察或预先确定的对抗性信息来对抗对抗性补丁,往往无法应对未见或自适应的对抗性攻击,并且在动态三维环境中表现出不令人满意的性能。受到人类主动感知和递归反馈机制的启发,我们开发了Embodied Active Defense(EAD),一种主动的防御策略,它积极地上下文化环境信息来解决三维现实场景中的错位对抗性补丁。为了实现这一目标,EAD开发了两个核心的循环子模块,即感知模块和策略模块,以实现主动视觉的两个关键功能。这些模型通过循环处理一系列的信念和观察,促进对目标对象的深入了解,并能够开发出针对三维环境中对抗性补丁的战略性行动。为了优化学习效率,我们引入了一种不同的环境动态的有条件近似,并部署对攻击策略无依赖的补丁。大量实验证明,EAD通过其动作策略在安全关键任务(如面部识别和物体检测)中显著增强了鲁棒性,而不会牺牲标准准确性。此外,由于攻击无关的特点,EAD有助于将未见攻击引导到很高的泛化水平,将未见攻击的平均成功率降低95%。
https://arxiv.org/abs/2404.00540
We show that off-the-shelf text-based Transformers, with no additional training, can perform few-shot in-context visual imitation learning, mapping visual observations to action sequences that emulate the demonstrator's behaviour. We achieve this by transforming visual observations (inputs) and trajectories of actions (outputs) into sequences of tokens that a text-pretrained Transformer (GPT-4 Turbo) can ingest and generate, via a framework we call Keypoint Action Tokens (KAT). Despite being trained only on language, we show that these Transformers excel at translating tokenised visual keypoint observations into action trajectories, performing on par or better than state-of-the-art imitation learning (diffusion policies) in the low-data regime on a suite of real-world, everyday tasks. Rather than operating in the language domain as is typical, KAT leverages text-based Transformers to operate in the vision and action domains to learn general patterns in demonstration data for highly efficient imitation learning, indicating promising new avenues for repurposing natural language models for embodied tasks. Videos are available at this https URL.
我们证明了,无需额外训练,基于文本的Transformer模型可以实现少样本的上下文视觉模仿学习,将视觉观察映射到模仿者行为的动作序列。我们通过一个我们称之为键点动作词(KAT)的框架实现了这一点。尽管这些模型仅在语言领域训练,但我们证明了这些Transformer在将标记的视觉关键点观察映射为动作序列方面表现出色,在低数据情况下与状态最先进的模仿学习(扩散策略)相当或者更好。与典型的在语言域操作不同,KAT利用基于文本的Transformer在视觉和动作域操作,以学习演示数据中高度有效的模仿学习,表明了将自然语言模型用于 embodied 任务的新的途径。视频可在此处访问:https://www.youtube.com/watch?v=uRstRQZ0Q7g。
https://arxiv.org/abs/2403.19578
Optimizing the morphologies and the controllers that adapt to various tasks is a critical issue in the field of robot design, aka. embodied intelligence. Previous works typically model it as a joint optimization problem and use search-based methods to find the optimal solution in the morphology space. However, they ignore the implicit knowledge of task-to-morphology mapping which can directly inspire robot design. For example, flipping heavier boxes tends to require more muscular robot arms. This paper proposes a novel and general differentiable task-inspired framework for contact-aware robot design called Task2Morph. We abstract task features highly related to task performance and use them to build a task-to-morphology mapping. Further, we embed the mapping into a differentiable robot design process, where the gradient information is leveraged for both the mapping learning and the whole optimization. The experiments are conducted on three scenarios, and the results validate that Task2Morph outperforms DiffHand, which lacks a task-inspired morphology module, in terms of efficiency and effectiveness.
优化形态和适应各种任务的控制器是机器人设计领域(即:智能体智能)的一个关键问题。以前的工作通常将其建模为联合优化问题,并使用基于搜索的方法在形态空间中寻找最优解。然而,他们忽略了任务与形态映射的隐含知识,这可能会直接激励机器人设计。例如,翻转更重的盒子通常需要更肌肉发达的机器人手臂。本文提出了一种以任务为导向的新颖且可导的任务感知机器人设计框架,称为Task2Morph。我们抽象出与任务绩效高度相关的任务特征,并使用它们构建任务与形态映射。此外,我们将映射嵌入到可导机器人设计过程中,其中梯度信息用于映射学习和整个优化。实验在三个场景中进行,结果证实了Task2Morph在效率和效果上优于缺乏任务指导的DiffHand,DiffHand缺乏任务导向的形态模块。
https://arxiv.org/abs/2403.19093
Grounding the common-sense reasoning of Large Language Models in physical domains remains a pivotal yet unsolved problem for embodied AI. Whereas prior works have focused on leveraging LLMs directly for planning in symbolic spaces, this work uses LLMs to guide the search of task structures and constraints implicit in multi-step demonstrations. Specifically, we borrow from manipulation planning literature the concept of mode families, which group robot configurations by specific motion constraints, to serve as an abstraction layer between the high-level language representations of an LLM and the low-level physical trajectories of a robot. By replaying a few human demonstrations with synthetic perturbations, we generate coverage over the demonstrations' state space with additional successful executions as well as counterfactuals that fail the task. Our explanation-based learning framework trains an end-to-end differentiable neural network to predict successful trajectories from failures and as a by-product learns classifiers that ground low-level states and images in mode families without dense labeling. The learned grounding classifiers can further be used to translate language plans into reactive policies in the physical domain in an interpretable manner. We show our approach improves the interpretability and reactivity of imitation learning through 2D navigation and simulated and real robot manipulation tasks. Website: this https URL
将大型语言模型在物理领域的常识推理 grounding 仍然是一个重要的 yet 尚未解决的问题,对于 embodied AI。尽管之前的工作集中利用 LLM 直接求解符号空间中的规划问题,但本文使用 LLM 指导多步骤演示中任务结构和约束的搜索。具体来说,我们借鉴了操作规划文献中的模式家族概念,将机器人配置分组为特定运动约束,作为抽象层,将 LLM 的高层次语言表示与机器人低层次物理轨迹之间进行划分。通过复现几段人类演示,我们通过合成扰动生成演示的状态空间覆盖,以及成功执行的额外覆盖,以及失败的预测。基于解释的学习框架通过端到端不同寻常的神经网络预测失败的轨迹,作为附加产品,学习类,将模式家族中的低级状态和图像与密集标注相结合。所学习到的 grounding 类可以进一步用于将语言计划翻译成在物理领域的响应策略,以解释方式进行。我们在 2D 导航和模拟和现实机器人操作任务中通过改善仿真的可解释性和反应性来演示我们的方法。网站:https://this URL
https://arxiv.org/abs/2403.17124
We consider the problem of Embodied Question Answering (EQA), which refers to settings where an embodied agent such as a robot needs to actively explore an environment to gather information until it is confident about the answer to a question. In this work, we leverage the strong semantic reasoning capabilities of large vision-language models (VLMs) to efficiently explore and answer such questions. However, there are two main challenges when using VLMs in EQA: they do not have an internal memory for mapping the scene to be able to plan how to explore over time, and their confidence can be miscalibrated and can cause the robot to prematurely stop exploration or over-explore. We propose a method that first builds a semantic map of the scene based on depth information and via visual prompting of a VLM - leveraging its vast knowledge of relevant regions of the scene for exploration. Next, we use conformal prediction to calibrate the VLM's question answering confidence, allowing the robot to know when to stop exploration - leading to a more calibrated and efficient exploration strategy. To test our framework in simulation, we also contribute a new EQA dataset with diverse, realistic human-robot scenarios and scenes built upon the Habitat-Matterport 3D Research Dataset (HM3D). Both simulated and real robot experiments show our proposed approach improves the performance and efficiency over baselines that do no leverage VLM for exploration or do not calibrate its confidence. Webpage with experiment videos and code: this https URL
我们考虑了Embodied Question Answering(EQA)问题,这指的是需要一个 embodied 代理(如机器人)积极探索环境以收集信息,直到对自己的问题答案充满信心为止的场景。在这项工作中,我们利用大型视觉语言模型(VLMs)的强大语义推理能力,高效地探索和回答这类问题。然而,在使用 VLMs 在 EQA 时存在两个主要挑战:它们没有内部记忆来将场景映射到以便规划随着时间的探索方式,而且它们的自信度可能被误估计,导致机器人过早停止探索或过度探索。我们提出了一个方法,首先根据深度信息基于场景构建语义图,并通过 VLM 的视觉提示实现 - 利用其对相关场景的丰富知识进行探索。接下来,我们使用 conformal 预测来校准 VLM 的问答自信度,使机器人知道何时停止探索 - 导致更校准和高效的探索策略。要在仿真中测试我们的框架,我们还为 Habitat-Matterport 3D 研究数据集(HM3D)构建了多样、真实的机器人场景和场景,并提供了新的 EQA 数据集。模拟和真实机器人实验都表明,我们所提出的方案在基线环境中取得了改进和效率提升。网页版实验视频及代码:https:// this URL
https://arxiv.org/abs/2403.15941
Open-vocabulary 3D scene understanding presents a significant challenge in computer vision, withwide-ranging applications in embodied agents and augmented reality systems. Previous approaches haveadopted Neural Radiance Fields (NeRFs) to analyze 3D scenes. In this paper, we introduce SemanticGaussians, a novel open-vocabulary scene understanding approach based on 3D Gaussian Splatting. Our keyidea is distilling pre-trained 2D semantics into 3D Gaussians. We design a versatile projection approachthat maps various 2Dsemantic features from pre-trained image encoders into a novel semantic component of 3D Gaussians, withoutthe additional training required by NeRFs. We further build a 3D semantic network that directly predictsthe semantic component from raw 3D Gaussians for fast inference. We explore several applications ofSemantic Gaussians: semantic segmentation on ScanNet-20, where our approach attains a 4.2% mIoU and 4.0%mAcc improvement over prior open-vocabulary scene understanding counterparts; object part segmentation,sceneediting, and spatial-temporal segmentation with better qualitative results over 2D and 3D baselines,highlighting its versatility and effectiveness on supporting diverse downstream tasks.
开放词汇3D场景理解在计算机视觉领域是一个重大的挑战,其在 embodied 代理和增强现实系统等广泛应用中具有重要作用。之前的方法采用神经辐射场(NeRFs)来分析3D场景。在本文中,我们引入了 SemanticGaussians,一种基于3D高斯平铺的全新的开放词汇场景理解方法。我们的关键想法是将预训练的2D语义信息蒸馏到3D高斯中。我们设计了一个可扩展的投影方法,将预训练图像编码器的各种2D语义特征映射到新颖的3D高斯语义组件中,而无需额外的训练。我们进一步构建了一个3D语义网络,直接从原始3D高斯中预测语义成分以实现快速推理。我们探讨了 Semantic Gaussians 的几个应用:在ScanNet-20上的语义分割,我们的方法实现了4.2%的mIoU和4.0%的mAcc改进;物体部分分割、场景编辑和空间时间分割,其结果在2D和3D基线上有更好的质量,突出了其多才多艺和对支持多样下游任务的有效性。
https://arxiv.org/abs/2403.15624
Navigating toward specific objects in unknown environments without additional training, known as Zero-Shot object navigation, poses a significant challenge in the field of robotics, which demands high levels of auxiliary information and strategic planning. Traditional works have focused on holistic solutions, overlooking the specific challenges agents encounter during navigation such as collision, low exploration efficiency, and misidentification of targets. To address these challenges, our work proposes TriHelper, a novel framework designed to assist agents dynamically through three primary navigation challenges: collision, exploration, and detection. Specifically, our framework consists of three innovative components: (i) Collision Helper, (ii) Exploration Helper, and (iii) Detection Helper. These components work collaboratively to solve these challenges throughout the navigation process. Experiments on the Habitat-Matterport 3D (HM3D) and Gibson datasets demonstrate that TriHelper significantly outperforms all existing baseline methods in Zero-Shot object navigation, showcasing superior success rates and exploration efficiency. Our ablation studies further underscore the effectiveness of each helper in addressing their respective challenges, notably enhancing the agent's navigation capabilities. By proposing TriHelper, we offer a fresh perspective on advancing the object navigation task, paving the way for future research in the domain of Embodied AI and visual-based navigation.
在未知环境中导航并定位特定目标,不需要额外训练,称为零 shot 对象导航,在机器人领域带来了巨大的挑战。传统工作集中于全局解决方案,忽略了 agents 在导航过程中遇到的具体挑战,如碰撞、低探索效率和目标识别错误等。为了应对这些挑战,我们的工作提出了 TriHelper,一种新颖的方法,旨在通过解决三个主要的导航挑战来帮助代理:碰撞助手、探索助手和检测助手。具体来说,我们的框架由三个创新组件组成:(i)碰撞助手,(ii)探索助手和(iii)检测助手。这些组件在导航过程中合作解决这些挑战。在 Habitat-Matterport 3D(HM3D)和 Gibson 数据集上的实验证明,TriHelper 在零 shot 对象导航中显著优于所有现有基线方法,展示了卓越的成功率和探索效率。我们的消融研究进一步证明了每个助手在解决其各自挑战方面的有效性,特别是提高了代理的导航能力。通过提出 TriHelper,我们为推进物体导航任务提供了新鲜的思路,为未来在领域 of Embodied AI 和基于视觉的导航进行研究奠定了基础。
https://arxiv.org/abs/2403.15223
Rapid advancements in 3D vision-language (3D-VL) tasks have opened up new avenues for human interaction with embodied agents or robots using natural language. Despite this progress, we find a notable limitation: existing 3D-VL models exhibit sensitivity to the styles of language input, struggling to understand sentences with the same semantic meaning but written in different variants. This observation raises a critical question: Can 3D vision-language models truly understand natural language? To test the language understandability of 3D-VL models, we first propose a language robustness task for systematically assessing 3D-VL models across various tasks, benchmarking their performance when presented with different language style variants. Importantly, these variants are commonly encountered in applications requiring direct interaction with humans, such as embodied robotics, given the diversity and unpredictability of human language. We propose a 3D Language Robustness Dataset, designed based on the characteristics of human language, to facilitate the systematic study of robustness. Our comprehensive evaluation uncovers a significant drop in the performance of all existing models across various 3D-VL tasks. Even the state-of-the-art 3D-LLM fails to understand some variants of the same sentences. Further in-depth analysis suggests that the existing models have a fragile and biased fusion module, which stems from the low diversity of the existing dataset. Finally, we propose a training-free module driven by LLM, which improves language robustness. Datasets and code will be available at github.
3D视觉语言(3D-VL)任务的快速发展为使用自然语言与具有自然语言能力的实体或机器人进行人机交互打开了新的途径。然而,我们发现一个显著的局限:现有的3D-VL模型对语言输入的风格非常敏感,但在不同变体的自然语言中理解句子时存在困难,即使它们具有相同的语义含义。这一观察结果引发了一个关键问题:3D视觉语言模型能否真正理解自然语言?为了测试3D-VL模型的语言可理解性,我们首先提出了一个语言鲁棒性任务,旨在系统地评估3D-VL模型在各种任务上的表现,以及在不同的语言风格变体面前对其性能的基准测试。重要的是,这些变体在需要直接与人类互动的应用程序中(如 embodied robotics)很常见,因为人类语言的多样性和不可预测性。我们提出了一个基于人类语言特征的3D语言鲁棒性数据集,以促进对鲁棒性的系统研究。我们的全面评估发现,所有现有模型的性能在各种3D-VL任务上都显著下降。即使是最先进的3D-LLM也无法理解某些相同句子的变体。进一步的分析表明,现有模型具有脆弱和有偏见的重叠模块,这是由于现有数据集的多样性导致的。最后,我们提出了一个基于LLM的无训练模块,可以提高语言鲁棒性。数据集和代码将在github上可用。
https://arxiv.org/abs/2403.14760
Object-goal navigation is a crucial engineering task for the community of embodied navigation; it involves navigating to an instance of a specified object category within unseen environments. Although extensive investigations have been conducted on both end-to-end and modular-based, data-driven approaches, fully enabling an agent to comprehend the environment through perceptual knowledge and perform object-goal navigation as efficiently as humans remains a significant challenge. Recently, large language models have shown potential in this task, thanks to their powerful capabilities for knowledge extraction and integration. In this study, we propose a data-driven, modular-based approach, trained on a dataset that incorporates common-sense knowledge of object-to-room relationships extracted from a large language model. We utilize the multi-channel Swin-Unet architecture to conduct multi-task learning incorporating with multimodal inputs. The results in the Habitat simulator demonstrate that our framework outperforms the baseline by an average of 10.6% in the efficiency metric, Success weighted by Path Length (SPL). The real-world demonstration shows that the proposed approach can efficiently conduct this task by traversing several rooms. For more details and real-world demonstrations, please check our project webpage (this https URL).
对象goal导航是对社区 embodied navigation 来说是一个关键的工程任务;它涉及在未见的环境中导航到指定物体类别的实例。尽管在端到端和基于模块的方法上已经进行了广泛的研究,但是通过数据驱动的方法,完全让智能体通过感知知识理解和执行物体goal导航仍然是一个重要的挑战。最近,大型语言模型已经在这项任务上展现了潜在能力,因为它们在知识提取和整合方面的强大能力。在这项研究中,我们提出了一个数据驱动、模块化的方法,通过训练一个包含从大型语言模型中提取的物体到房间之间常见知识的大型语言模型数据集来训练。我们利用Swin-Unet多通道架构进行多任务学习,结合多模态输入。 Habitat仿真器的结果表明,我们的框架在效率指标上平均比基线高10.6%,成功权衡路径长度(SPL)。在现实世界的演示中,我们的方法通过穿越几个房间高效地执行了这项任务。更多细节和现实世界的演示,请查看我们的项目网页(此链接)。
https://arxiv.org/abs/2403.14163
We demonstrate experimental results with LLMs that address robotics action planning problems. Recently, LLMs have been applied in robotics action planning, particularly using a code generation approach that converts complex high-level instructions into mid-level policy codes. In contrast, our approach acquires text descriptions of the task and scene objects, then formulates action planning through natural language reasoning, and outputs coordinate level control commands, thus reducing the necessity for intermediate representation code as policies. Our approach is evaluated on a multi-modal prompt simulation benchmark, demonstrating that our prompt engineering experiments with natural language reasoning significantly enhance success rates compared to its absence. Furthermore, our approach illustrates the potential for natural language descriptions to transfer robotics skills from known tasks to previously unseen tasks.
我们通过LLM展示了机器人动作规划问题的实验结果。最近,LLM已经应用于机器人动作规划,特别是使用将复杂高级指令转换为中级策略代码的代码生成方法。相比之下,我们的方法获取了任务和场景对象的文本描述,然后通过自然语言推理进行动作规划,并输出协调级别控制命令,从而减少了中间表示码作为策略的必要性。我们的方法在多模态提示模拟基准上进行了评估,证明了我们的提示工程与自然语言推理相结合显著提高了成功率。此外,我们的方法展示了自然语言描述在将机器人技能从已知任务转移到未知任务方面的潜力。
https://arxiv.org/abs/2403.13801
We study the task of 3D multi-object re-identification from embodied tours. Specifically, an agent is given two tours of an environment (e.g. an apartment) under two different layouts (e.g. arrangements of furniture). Its task is to detect and re-identify objects in 3D - e.g. a "sofa" moved from location A to B, a new "chair" in the second layout at location C, or a "lamp" from location D in the first layout missing in the second. To support this task, we create an automated infrastructure to generate paired egocentric tours of initial/modified layouts in the Habitat simulator using Matterport3D scenes, YCB and Google-scanned objects. We present 3D Semantic MapNet (3D-SMNet) - a two-stage re-identification model consisting of (1) a 3D object detector that operates on RGB-D videos with known pose, and (2) a differentiable object matching module that solves correspondence estimation between two sets of 3D bounding boxes. Overall, 3D-SMNet builds object-based maps of each layout and then uses a differentiable matcher to re-identify objects across the tours. After training 3D-SMNet on our generated episodes, we demonstrate zero-shot transfer to real-world rearrangement scenarios by instantiating our task in Replica, Active Vision, and RIO environments depicting rearrangements. On all datasets, we find 3D-SMNet outperforms competitive baselines. Further, we show jointly training on real and generated episodes can lead to significant improvements over training on real data alone.
我们研究了从 embodied 导览中学习 3D 多对象识别的任务。具体来说,一个代理被给予两个环境(例如公寓)的不同布局(例如家具排列)。它的任务是检测和识别 3D 中的对象 - 例如从位置 A 到位置 B 的“沙发”,位置 C 的第二个布局中的新“椅子”,或者在第一个布局中缺失的位置 D 的“灯”。为了支持这项任务,我们创建了一个自动化的基础设施,使用Matterport3D 场景、YCB 和 Google-扫描的对象生成初始/修改布局的对称形导览。我们展示了 3D 语义图网络(3D-SMNet),这是一种由两个阶段组成的识别模型,其第一阶段是一个在已知姿态的 RGB-D 视频上运行的 3D 物体检测器,第二阶段是一个用于解决两个 3D 边界框之间对应关系的不同可导模块。总的来说,3D-SMNet 构建了每个布局的物体基础映射,然后使用可导匹配器在导览之间重新识别物体。在用我们的生成任务训练 3D-SMNet 后,我们在 Replica、Active Vision 和 RIO 等环境中通过实例展示了零散转移到现实世界的重新排列场景。在所有数据集中,我们发现 3D-SMNet 都优于竞争基线。此外,我们还证明了在真实和生成任务上共同训练可以带来在训练仅基于真实数据时的显著改进。
https://arxiv.org/abs/2403.13190
Soldiers in the field often need to cross negative obstacles, such as rivers or canyons, to reach goals or safety. Military gap crossing involves on-site temporary bridges construction. However, this procedure is conducted with dangerous, time and labor intensive operations, and specialized machinery. We envision a scalable robotic solution inspired by advancements in force-controlled and Cable Driven Parallel Robots (CDPRs); this solution can address the challenges inherent in this transportation problem, achieving fast, efficient, and safe deployment and field operations. We introduce the embodied vision in Co3MaNDR, a solution to the military gap crossing problem, a distributed robot consisting of several modules simultaneously pulling on a central payload, controlling the cables' tensions to achieve complex objectives, such as precise trajectory tracking or force amplification. Hardware experiments demonstrate teleoperation of a payload, trajectory following, and the sensing and amplification of operators' applied physical forces during slow operations. An operator was shown to manipulate a 27.2 kg (60 lb) payload with an average force utilization of 14.5\% of its weight. Results indicate that the system can be scaled up to heavier payloads without compromising performance or introducing superfluous complexity. This research lays a foundation to expand CDPR technology to uncoordinated and unstable mobile platforms in unknown environments.
士兵在战场经常需要跨越 negative 障碍,如河流或峡谷,以达到目标或安全。军事悬空跨越涉及现场临时桥梁施工。然而,这种程序涉及危险、耗时且劳动密集的操作,并使用专门的机械设备进行操作。我们设想了一个基于先进力控制和电缆驱动并行机器人(CDPRs)的可扩展机器人解决方案;这一解决方案可以解决这种运输问题,实现快速、高效、安全部署和现场操作。我们在Co3MaNDR中引入了人体视觉,这是解决军事悬空跨越问题的一种解决方案,是一种由几个模块同时拉着中央重物的分布式机器人。该解决方案通过控制电缆张力来实现复杂的目标,例如精确的轨迹跟踪或力量放大。硬件实验表明,在慢速操作过程中,操作员可以操纵一个27.2千克(60磅)的负载,平均力量利用率为14.5%。结果表明,系统可以在不牺牲性能或引入多余复杂性的情况下扩展到更重的负载。这项研究为将CDPR技术扩展到未知环境中的非协调和不稳定的移动平台奠定了基础。
https://arxiv.org/abs/2403.13124
What applications is AI ready for? Advances in deep learning and generative approaches have produced AI that learn from massive online data and outperform manually built AIs. Some AIs outperform people. It is easy (but misleading) to conclude that today's AI technologies can learn to do everything. Conversely, it is striking that big data, deep learning, and generative AI have had so little impact on robotics. For example, today's autonomous robots do not learn to provide home care or to be nursing assistants. Instead, current projects rely on mathematical models, planning frameworks, and reinforcement learning. These methods have not lead to the leaps in performance and generality seen with deep learning. Today's AIs do not learn to do such applications because they do not collect, use, and effectively generalize the necessary experiential data by interacting with the world including people. Aspirationally, robotic AIs would learn experientially, learn from people, serve people broadly, and collaborate with them. Getting to such a future requires understanding the opportunity and creating a path to get there. A path forward would combine multimodal sensing and motor control technology from robotics with deep learning technology adapted for embodied systems. Analogous to foundation classes in deep learning, it would create experiential foundation classes. Success would greatly increase the broad utility of AI robots and grow the market for them. This would lead to lower costs and democratize AI.
人工智能可以应用于哪些领域?深度学习和生成方法的发展已经产生了一些能够从大规模在线数据中学习的AI,并超过了手工构建的AI。有些AI甚至超过了人类的表现。很容易(但误导人)得出结论,今天的AI技术可以学会一切。相反,令人值得注意的是,大数据、深度学习和生成型AI对机器人学的影响非常小。例如,今天的自主机器人不会学习提供家务或护理服务的技能。相反,当前的项目依赖于数学模型、规划框架和强化学习。这些方法并没有带来与深度学习相应的表现和普适性飞跃。今天的AI之所以无法学会这样的应用,是因为它们无法收集、使用并有效地通过与包括人类在内的世界的交互来 generalize所需的实践数据。 梦想是,机器人AI会通过经验学习、从人那里学习,以及与人合作,广泛服务人类。要达到这样的未来,需要理解机遇并创造通往那里的路径。前进的道路将结合机器人学中的多模态传感和运动控制技术以及为身体系统设计的深度学习技术。与深度学习基础课程类似,它将创建实践基础课程。成功将极大地增加AI机器人的广泛应用,并推动它们的市场份额。这将导致成本降低,并使AI民主化。
https://arxiv.org/abs/2404.04267
Large Language Models (LLMs) have emerged as integral tools for reasoning, planning, and decision-making, drawing upon their extensive world knowledge and proficiency in language-related tasks. LLMs thus hold tremendous potential for natural language interaction within multi-agent systems to foster cooperation. However, LLM agents tend to over-report and comply with any instruction, which may result in information redundancy and confusion in multi-agent cooperation. Inspired by human organizations, this paper introduces a framework that imposes prompt-based organization structures on LLM agents to mitigate these problems. Through a series of experiments with embodied LLM agents and human-agent collaboration, our results highlight the impact of designated leadership on team efficiency, shedding light on the leadership qualities displayed by LLM agents and their spontaneous cooperative behaviors. Further, we harness the potential of LLMs to propose enhanced organizational prompts, via a Criticize-Reflect process, resulting in novel organization structures that reduce communication costs and enhance team efficiency.
大语言模型(LLMs)作为一种推理、规划和决策的工具,利用其丰富的世界知识和在语言相关任务上的熟练程度,为多智能体系统中的自然语言交互带来了巨大的潜力,以促进合作。然而,LLM代理倾向于过度报告并遵守任何指令,可能导致信息冗余和多智能体合作中的困惑。受到人类组织的启发,本文引入了一种基于提示的组织结构框架来解决这些问题。通过与 embodied LLM 代理和人类代理的合作实验,我们的结果突出了指定领导对团队效率的影响,揭示了 LLM 代理所展示的领导素质以及他们的自发合作行为。此外,我们利用 LLMs 的潜力提出了增强组织提示的策略,通过批判-反思过程,产生了新的组织结构,降低了沟通成本并提高了团队效率。
https://arxiv.org/abs/2403.12482