Autonomous embodied agents live on an Internet of multimedia websites. Can they hop around multimodal websites to complete complex user tasks? Existing benchmarks fail to assess them in a realistic, evolving environment for their embodiment across websites. To answer this question, we present MMInA, a multihop and multimodal benchmark to evaluate the embodied agents for compositional Internet tasks, with several appealing properties: 1) Evolving real-world multimodal websites. Our benchmark uniquely operates on evolving real-world websites, ensuring a high degree of realism and applicability to natural user tasks. Our data includes 1,050 human-written tasks covering various domains such as shopping and travel, with each task requiring the agent to autonomously extract multimodal information from web pages as observations; 2) Multihop web browsing. Our dataset features naturally compositional tasks that require information from or actions on multiple websites to solve, to assess long-range reasoning capabilities on web tasks; 3) Holistic evaluation. We propose a novel protocol for evaluating an agent's progress in completing multihop tasks. We experiment with both standalone (multimodal) language models and heuristic-based web agents. Extensive experiments demonstrate that while long-chain multihop web tasks are easy for humans, they remain challenging for state-of-the-art web agents. We identify that agents are more likely to fail on the early hops when solving tasks of more hops, which results in lower task success rates. To address this issue, we propose a simple memory augmentation approach replaying past action trajectories to reflect. Our method significantly improved both the single-hop and multihop web browsing abilities of agents. See our code and data at this https URL
自主嵌入式代理生活在多媒体网站的互联网上。它们能否在多模态网站上跳跃以完成复杂的用户任务呢?现有的基准测试无法在现实、不断演变的環境中评估它们在網站上的本体嵌入。为了回答这个问题,我们提出了MMInA,一个多跳和多模态基准来评估用于多模态网站的代理,具有几个有趣的属性:1)不断演变的现实世界多模态网站。我们的基准独特地运行在不断演变的现实世界网站上,确保了高度的现实主义和应用性,以应对自然用户任务;2)多跳网页浏览。我们的數據集包括1,050个由人类编写的任务,涵盖了各种领域,如购物和旅游,每个任务都需要代理从网站页面自动提取多模态信息作为观察结果;3)整体评估。我们提出了一个新颖的代理完成多跳任务进度的评估协议。我们分别与单独的多模态语言模型和基于规则的网页代理进行实验。 extensive实验证明,尽管长链条多跳网页任务对人类来说是容易的,但它们仍然对最先进的网络代理具有挑战性。我们发现,代理在解决多跳任务时更容易在较早的跳数上失败,导致任务成功率降低。为了解决这个问题,我们提出了一个简单的记忆增强方法,通过重放过去的动作轨迹来反映。我们的方法显著提高了代理的单跳和多跳网页浏览能力。您可以在此处查看我们的代码和数据:https://www.mmina.org/
https://arxiv.org/abs/2404.09992
Embodied visual tracking is to follow a target object in dynamic 3D environments using an agent's egocentric vision. This is a vital and challenging skill for embodied agents. However, existing methods suffer from inefficient training and poor generalization. In this paper, we propose a novel framework that combines visual foundation models (VFM) and offline reinforcement learning (offline RL) to empower embodied visual tracking. We use a pre-trained VFM, such as ``Tracking Anything", to extract semantic segmentation masks with text prompts. We then train a recurrent policy network with offline RL, e.g., Conservative Q-Learning, to learn from the collected demonstrations without online agent-environment interactions. To further improve the robustness and generalization of the policy network, we also introduce a mask re-targeting mechanism and a multi-level data collection strategy. In this way, we can train a robust tracker within an hour on a consumer-level GPU, e.g., Nvidia RTX 3090. Such efficiency is unprecedented for RL-based visual tracking methods. We evaluate our tracker on several high-fidelity environments with challenging situations, such as distraction and occlusion. The results show that our agent outperforms state-of-the-art methods in terms of sample efficiency, robustness to distractors, and generalization to unseen scenarios and targets. We also demonstrate the transferability of the learned tracker from the virtual world to real-world scenarios.
肢体视觉跟踪是通过使用代理的以自我为中心的视觉来跟随动态3D环境中的目标对象。这是 embodied 代理的一种关键和具有挑战性的技能。然而,现有的方法在训练和泛化方面存在效率低和表现差的问题。在本文中,我们提出了一种结合视觉基础模型(VFM)和离线强化学习(offline RL)的新框架,以增强 embodied 视觉跟踪。我们使用预训练的 VFM,如 "Tracking Anything",以提取带文本提示的语义分割掩码。然后,我们使用离线 RL 训练一个循环策略网络,例如 Conservative Q-Learning,以从收集的演示中学习,而无需与在线代理环境和交互。为了进一步提高策略网络的稳健性和泛化性,我们还引入了掩码重置机制和多级数据收集策略。通过这种方式,我们可以在消费者级 GPU(例如 Nvidia RTX 3090)上训练一个稳健的跟踪器,例如一个小时。这是基于 RL 的视觉跟踪方法前所未有的效率。我们在具有挑战性的环境中评估我们的跟踪器,例如分心和遮挡。结果表明,我们的代理在样本效率、对干扰者的鲁棒性和对未见过的场景和目标的泛化方面优于最先进的 methods。我们还证明了从虚拟世界中学到的跟踪器在现实世界场景中的可转移性。
https://arxiv.org/abs/2404.09857
Human beings construct perception of space by integrating sparse observations into massively interconnected synapses and neurons, offering a superior parallelism and efficiency. Replicating this capability in AI finds wide applications in medical imaging, AR/VR, and embodied AI, where input data is often sparse and computing resources are limited. However, traditional signal reconstruction methods on digital computers face both software and hardware challenges. On the software front, difficulties arise from storage inefficiencies in conventional explicit signal representation. Hardware obstacles include the von Neumann bottleneck, which limits data transfer between the CPU and memory, and the limitations of CMOS circuits in supporting parallel processing. We propose a systematic approach with software-hardware co-optimizations for signal reconstruction from sparse inputs. Software-wise, we employ neural field to implicitly represent signals via neural networks, which is further compressed using low-rank decomposition and structured pruning. Hardware-wise, we design a resistive memory-based computing-in-memory (CIM) platform, featuring a Gaussian Encoder (GE) and an MLP Processing Engine (PE). The GE harnesses the intrinsic stochasticity of resistive memory for efficient input encoding, while the PE achieves precise weight mapping through a Hardware-Aware Quantization (HAQ) circuit. We demonstrate the system's efficacy on a 40nm 256Kb resistive memory-based in-memory computing macro, achieving huge energy efficiency and parallelism improvements without compromising reconstruction quality in tasks like 3D CT sparse reconstruction, novel view synthesis, and novel view synthesis for dynamic scenes. This work advances the AI-driven signal restoration technology and paves the way for future efficient and robust medical AI and 3D vision applications.
人类通过将稀疏观测整合到密集连接的神经元中,构建了我们对空间的感知,这使得人工智能在医学成像、增强现实(AR)和 embodied AI等领域具有卓越的并行度和效率。在AI中实现这种能力面临着软件和硬件方面的挑战。在软件方面,困难源于传统显式信号表示中存储效率低下。硬件方面,包括由冯·诺伊曼瓶颈限制了CPU和内存之间的数据传输,以及CMOS电路在支持并行处理方面的限制。我们提出了一个软件和硬件协同优化的信号重构系统,可以从稀疏输入中恢复信号。在软件方面,我们使用神经场通过神经网络隐含表示信号,并使用低秩分解和结构化剪裁进一步压缩。在硬件方面,我们设计了一个基于电阻性内存的计算在内存(CIM)平台,包括一个高斯编码器(GE)和一个多层感知器(MLP处理引擎(PE)。GE利用电阻性内存的固有随机性实现高效的输入编码,而PE通过硬件感知量化(HAQ)电路实现精确的权重映射。我们在基于40nm的256Kb电阻性内存的内存计算宏观上展示了系统的效果,实现了巨大的能效和并行度改进,而不会牺牲重构质量,例如3D CT稀疏重建、新颖视图合成和动态场景下的新颖视图合成。这项工作推动了AI驱动的信号修复技术的发展,为未来的高效和可靠的医疗AI和3D视觉应用铺平了道路。
https://arxiv.org/abs/2404.09613
With recent developments in Embodied Artificial Intelligence (EAI) research, there has been a growing demand for high-quality, large-scale interactive scene generation. While prior methods in scene synthesis have prioritized the naturalness and realism of the generated scenes, the physical plausibility and interactivity of scenes have been largely left unexplored. To address this disparity, we introduce PhyScene, a novel method dedicated to generating interactive 3D scenes characterized by realistic layouts, articulated objects, and rich physical interactivity tailored for embodied agents. Based on a conditional diffusion model for capturing scene layouts, we devise novel physics- and interactivity-based guidance mechanisms that integrate constraints from object collision, room layout, and object reachability. Through extensive experiments, we demonstrate that PhyScene effectively leverages these guidance functions for physically interactable scene synthesis, outperforming existing state-of-the-art scene synthesis methods by a large margin. Our findings suggest that the scenes generated by PhyScene hold considerable potential for facilitating diverse skill acquisition among agents within interactive environments, thereby catalyzing further advancements in embodied AI research. Project website: this http URL.
随着最近在Embodied人工智能(EAI)研究中的发展,对于高质量、大规模互动场景生成的高需求不断增加。然而,以前的方法在场景合成中过于关注生成场景的自然性和真实性,而场景的物理可行性和互动性却被大大忽视了。为了应对这一差异,我们引入了PhyScene,一种专为身体代理生成具有真实布局、关节和丰富物理交互性的交互式3D场景的新方法。基于条件扩散模型来捕捉场景布局,我们设计了一种新的基于物理和交互性的指导机制,结合了物体碰撞、房间布局和物体可达性等方面的约束。通过大量实验,我们证明了PhyScene有效地利用了这些指导功能进行物理交互式场景生成,在现有状态最先进的方法之上取得了很大的优势。我们的研究结果表明,PhyScene生成的场景在促进交互环境中的代理多样化技能学习方面具有相当大的潜力,从而推动了在身体人工智能研究中的进一步发展。项目网站:此链接。
https://arxiv.org/abs/2404.09465
Recent trends have shown that autonomous agents, such as Autonomous Ground Vehicles (AGVs), Unmanned Aerial Vehicles (UAVs), and mobile robots, effectively improve human productivity in solving diverse tasks. However, since these agents are typically powered by portable batteries, they require extremely low power/energy consumption to operate in a long lifespan. To solve this challenge, neuromorphic computing has emerged as a promising solution, where bio-inspired Spiking Neural Networks (SNNs) use spikes from event-based cameras or data conversion pre-processing to perform sparse computations efficiently. However, the studies of SNN deployments for autonomous agents are still at an early stage. Hence, the optimization stages for enabling efficient embodied SNN deployments for autonomous agents have not been defined systematically. Toward this, we propose a novel framework called SNN4Agents that consists of a set of optimization techniques for designing energy-efficient embodied SNNs targeting autonomous agent applications. Our SNN4Agents employs weight quantization, timestep reduction, and attention window reduction to jointly improve the energy efficiency, reduce the memory footprint, optimize the processing latency, while maintaining high accuracy. In the evaluation, we investigate use cases of event-based car recognition, and explore the trade-offs among accuracy, latency, memory, and energy consumption. The experimental results show that our proposed framework can maintain high accuracy (i.e., 84.12% accuracy) with 68.75% memory saving, 3.58x speed-up, and 4.03x energy efficiency improvement as compared to the state-of-the-art work for NCARS dataset, thereby enabling energy-efficient embodied SNN deployments for autonomous agents.
近年来,自动驾驶车辆(AGVs)、无人机(UAVs)和移动机器人等自主 agent有效提高了人类在解决多样化任务中的生产力。然而,由于这些 agent通常由便携式电池供电,因此它们在长时间内运行时需要极其低功耗/能量。为解决这个问题,神经形态计算作为一种有前景的解决方案应运而生,其中仿生 Spiking Neural Networks (SNNs) 使用基于事件的数据转换预处理或事件相机中的尖峰来执行稀疏计算 efficiently。然而,针对自主 agent 的 SNN 部署的研究仍处于早期阶段。因此,尚未对 enabling efficient embodied SNN deployments for autonomous agents 的优化阶段进行系统地定义。为了实现这一目标,我们提出了一个名为 SNN4Agents 的 novel framework,它包括一个针对自主 agent 应用设计能量高效的 embodied SNN 的优化技术集合。我们的 SNN4Agents 使用权重量化、时钟步减少和注意力窗口减少来共同提高能源效率、降低内存足迹、优化处理延迟,同时保持高精度。在评估中,我们研究了基于事件的汽车识别用例,并探讨了准确性、延迟、内存和能量消耗之间的权衡。实验结果表明,与最先进的 NCARS 数据集相比,我们的框架可以在降低 68.75% 的内存和使用 68.75% 更快的速度和 4.03x 的能量效率提升的同时保持高精度(即 84.12% 的准确率),从而实现自主 agent 的高效 embodied SNN 部署。
https://arxiv.org/abs/2404.09331
Home robots intend to make their users lives easier. Our work assists in this goal by enabling robots to inform their users of dangerous or unsanitary anomalies in their home. Some examples of these anomalies include the user leaving their milk out, forgetting to turn off the stove, or leaving poison accessible to children. To move towards enabling home robots with these abilities, we have created a new dataset, which we call SafetyDetect. The SafetyDetect dataset consists of 1000 anomalous home scenes, each of which contains unsafe or unsanitary situations for an agent to detect. Our approach utilizes large language models (LLMs) alongside both a graph representation of the scene and the relationships between the objects in the scene. Our key insight is that this connected scene graph and the object relationships it encodes enables the LLM to better reason about the scene -- especially as it relates to detecting dangerous or unsanitary situations. Our most promising approach utilizes GPT-4 and pursues a categorization technique where object relations from the scene graph are classified as normal, dangerous, unsanitary, or dangerous for children. This method is able to correctly identify over 90% of anomalous scenarios in the SafetyDetect Dataset. Additionally, we conduct real world experiments on a ClearPath TurtleBot where we generate a scene graph from visuals of the real world scene, and run our approach with no modification. This setup resulted in little performance loss. The SafetyDetect Dataset and code will be released to the public upon this papers publication.
家庭机器人旨在使使用者的生活更加便捷。我们的工作通过使机器人通知用户他们在家中的危险或不卫生的异常情况来实现这一目标。这些异常情况包括用户将牛奶放在桌子上,忘记关炉子,或者将毒物留给孩子们。为了实现具有这些能力的家庭机器人,我们创建了一个名为SafetyDetect的新数据集,我们称之为安全检测数据集。安全检测数据集包括1000个异常的家庭场景,每个场景都包含一个机器人可以检测到的不安全或不卫生的情况。我们的方法利用了大型语言模型(LLMs),并借助场景图和场景中物体的关系来表示场景。我们的关键见解是,这个连接的场景图和它编码的对象关系能够使LLM更好地理解场景,尤其是与检测危险或不卫生的情况有关的情况。 我们最具有前景的方法利用了GPT-4,并采用了一种分类技术,将场景图中的物体关系分类为正常、危险、不卫生或危险。这种方法在安全检测数据集中的异常场景中能够正确地识别超过90%的情况。此外,我们在ClearPath TurtleBot上进行了现实世界的实验,从现实世界的场景视觉中生成场景图,并运行我们的方法。这个设置结果几乎没有性能损失。安全检测数据集和代码将在本文发表时公开发布。
https://arxiv.org/abs/2404.08827
This paper introduces a novel zero-shot motion planning method that allows users to quickly design smooth robot motions in Cartesian space. A Bézier curve-based Cartesian plan is transformed into a joint space trajectory by our neuro-inspired inverse kinematics (IK) method CycleIK, for which we enable platform independence by scaling it to arbitrary robot designs. The motion planner is evaluated on the physical hardware of the two humanoid robots NICO and NICOL in a human-in-the-loop grasping scenario. Our method is deployed with an embodied agent that is a large language model (LLM) at its core. We generalize the embodied agent, that was introduced for NICOL, to also be embodied by NICO. The agent can execute a discrete set of physical actions and allows the user to verbally instruct various different robots. We contribute a grasping primitive to its action space that allows for precise manipulation of household objects. The new CycleIK method is compared to popular numerical IK solvers and state-of-the-art neural IK methods in simulation and is shown to be competitive with or outperform all evaluated methods when the algorithm runtime is very short. The grasping primitive is evaluated on both NICOL and NICO robots with a reported grasp success of 72% to 82% for each robot, respectively.
本文提出了一种新颖的零 shot运动规划方法,允许用户在二维空间中快速设计平滑的机器人运动。通过我们基于Bézier曲线的人体启发式逆运动学(IK)方法CycleIK,将基于Bézier曲线的二维计划变换为机器人空间轨迹。该运动规划器在人类监督下的两个大型机器人NICO和NICOL上的物理硬件上进行评估。我们使用具有身体代理的自主机器人(LLM)来部署该方法。我们还将基于NICO的 embodied agent 扩展到也具有NICO 的身体代理。该代理可以执行一系列物理动作,并允许用户通过口头指令控制各种不同机器人。我们在其动作空间中添加了抓握原语,允许用户精确操作家庭用品。与流行的数值IK求解器和最先进的神经IK方法在模拟中进行了比较,并在算法运行时间非常短时,证明了该方法与所有评估方法具有竞争性或优越性。抓握原语在NICOL和NICO机器人上的报告抓握成功率在72%到82%之间。
https://arxiv.org/abs/2404.08825
The generalization of the end-to-end deep reinforcement learning (DRL) for object-goal visual navigation is a long-standing challenge since object classes and placements vary in new test environments. Learning domain-independent visual representation is critical for enabling the trained DRL agent with the ability to generalize to unseen scenes and objects. In this letter, a target-directed attention network (TDANet) is proposed to learn the end-to-end object-goal visual navigation policy with zero-shot ability. TDANet features a novel target attention (TA) module that learns both the spatial and semantic relationships among objects to help TDANet focus on the most relevant observed objects to the target. With the Siamese architecture (SA) design, TDANet distinguishes the difference between the current and target states and generates the domain-independent visual representation. To evaluate the navigation performance of TDANet, extensive experiments are conducted in the AI2-THOR embodied AI environment. The simulation results demonstrate a strong generalization ability of TDANet to unseen scenes and target objects, with higher navigation success rate (SR) and success weighted by length (SPL) than other state-of-the-art models.
端到端深度强化学习(DRL)在物体目标视觉导航中的泛化是一个长期以来的挑战,因为不同的测试环境中的物体类和放置位置不同。学习领域无关的视觉表示对于让训练后的DRL智能体具有泛化到未见过的场景和物体的能力至关重要。在本文中,提出了一种目标定向注意力网络(TDANet),用于学习具有零散射击能力的端到端物体目标视觉导航策略。TDANet具有一个新颖的目标注意力(TA)模块,学习物体之间的空间和语义关系,以帮助TDANet关注目标中最相关的观测物体。通过Siamese架构(SA)设计,TDANet区分了当前状态和目标状态,并生成领域无关的视觉表示。为了评估TDANet的导航性能,在AI2-THOR embodied AI环境中进行了广泛的实验。模拟结果表明,TDANet对未见过的场景和目标物体的泛化能力很强,具有比其他最先进的模型更高的导航成功率(SR)和成功加权长度(SPL)。
https://arxiv.org/abs/2404.08353
Our goal is to build embodied agents that can learn inductively generalizable spatial concepts in a continual manner, e.g, constructing a tower of a given height. Existing work suffers from certain limitations (a) (Liang et al., 2023) and their multi-modal extensions, rely heavily on prior knowledge and are not grounded in the demonstrations (b) (Liu et al., 2023) lack the ability to generalize due to their purely neural approach. A key challenge is to achieve a fine balance between symbolic representations which have the capability to generalize, and neural representations that are physically grounded. In response, we propose a neuro-symbolic approach by expressing inductive concepts as symbolic compositions over grounded neural concepts. Our key insight is to decompose the concept learning problem into the following steps 1) Sketch: Getting a programmatic representation for the given instruction 2) Plan: Perform Model-Based RL over the sequence of grounded neural action concepts to learn a grounded plan 3) Generalize: Abstract out a generic (lifted) Python program to facilitate generalizability. Continual learning is achieved by interspersing learning of grounded neural concepts with higher level symbolic constructs. Our experiments demonstrate that our approach significantly outperforms existing baselines in terms of its ability to learn novel concepts and generalize inductively.
我们的目标是构建能够以连续的方式学习归纳通用空间概念的 embodied 代理,例如,构建一个指定高度的塔。现有工作存在某些局限性:(a)(Liang 等人,2023) 和他们的多模态扩展,严重依赖先验知识,并且缺乏演示(b)(Liu 等人,2023) 的物理基础,无法进行泛化。一个关键挑战是实现符号表示与具有泛化能力的神经表示之间的良好平衡。为了应对这一挑战,我们提出了一个基于神经符号的方法,通过将归纳性概念表达为 grounded 神经概念的符号组合来表示。我们的关键见解是将概念学习问题分解为以下步骤:1) 草图:为给定指令获得程序化表示;2)规划:在 grounded 神经动作概念的序列上执行基于模型的强化学习,以学习一个 grounded 计划;3)泛化:抽象出通用的(lifted)Python 程序,以促进泛化。持续学习是通过在 grounded 神经概念的学习中插入更高层次的符号构建来实现的。我们的实验证明,与其他现有基线相比,我们的方法在学习和泛化方面的表现显著出色。
https://arxiv.org/abs/2404.07774
Language models trained on internet-scale data sets have shown an impressive ability to solve problems in Natural Language Processing and Computer Vision. However, experience is showing that these models are frequently brittle in unexpected ways, and require significant scaffolding to ensure that they operate correctly in the larger systems that comprise "language-model agents." In this paper, we argue that behavior trees provide a unifying framework for combining language models with classical AI and traditional programming. We introduce Dendron, a Python library for programming language model agents using behavior trees. We demonstrate the approach embodied by Dendron in three case studies: building a chat agent, a camera-based infrastructure inspection agent for use on a mobile robot or vehicle, and an agent that has been built to satisfy safety constraints that it did not receive through instruction tuning or RLHF.
在互联网规模数据集上训练的语言模型展现出在自然语言处理和计算机视觉问题中解决问题的令人印象深刻的能力。然而,经验表明,这些模型往往会在意外的方式中变得脆弱,需要显著的支架来确保它们在大型包含 "语言模型代理" 的系统中正常运行。在本文中,我们认为行为树提供了一个统一框架,将语言模型与经典人工智能和传统编程相结合。我们介绍了 Dendron,一个使用行为树的 Python 库来编写语言模型代理。我们通过三个实证研究展示了 Dendron 所代表的方案:构建聊天机器人、用于移动机器人或车辆的相机基础设施检查代理和一种已通过指令调整或 RLHF 满足安全约束的代理。
https://arxiv.org/abs/2404.07439
There is a growing interest in applying large language models (LLMs) in robotic tasks, due to their remarkable reasoning ability and extensive knowledge learned from vast training corpora. Grounding LLMs in the physical world remains an open challenge as they can only process textual input. Recent advancements in large vision-language models (LVLMs) have enabled a more comprehensive understanding of the physical world by incorporating visual input, which provides richer contextual information than language alone. In this work, we proposed a novel paradigm that leveraged GPT-4V(ision), the state-of-the-art LVLM by OpenAI, to enable embodied agents to perceive liquid objects via image-based environmental feedback. Specifically, we exploited the physical understanding of GPT-4V to interpret the visual representation (e.g., time-series plot) of non-visual feedback (e.g., F/T sensor data), indirectly enabling multimodal perception beyond vision and language using images as proxies. We evaluated our method using 10 common household liquids with containers of various geometry and material. Without any training or fine-tuning, we demonstrated that our method can enable the robot to indirectly perceive the physical response of liquids and estimate their viscosity. We also showed that by jointly reasoning over the visual and physical attributes learned through interactions, our method could recognize liquid objects in the absence of strong visual cues (e.g., container labels with legible text or symbols), increasing the accuracy from 69.0% -- achieved by the best-performing vision-only variant -- to 86.0%.
随着大型语言模型(LLMs)在机器人任务中的应用越来越受到关注,这是因为它们出色的推理能力和从庞大的训练数据集中的广泛知识。然而,将LLMs grounded in the physical world仍然是一个开放挑战,因为它们只能处理文本输入。近年来,大型视觉语言模型(LVLM)的进步使得对物理世界的更全面理解成为可能,通过将视觉输入集成到模型中,提供了比语言更丰富的上下文信息。在这项工作中,我们提出了一个新颖的方法,即利用GPT-4V(由OpenAI开发的最新LVLM),使实体代理通过图像为基础的环境反馈感知液体物体。具体来说,我们利用GPT-4V的物理理解来解释非视觉反馈(如F/T传感器数据)的视觉表示,从而通过图像作为代理实现多模态感知,超越视觉和语言。我们对我们的方法进行了评估,使用具有各种几何形状和材料的10个常见家庭液体容器进行了实验。在没有进行训练或微调的情况下,我们证明了我们的方法可以使机器人间接感知液体的物理响应,并估计其粘度。我们还证明了通过联合推理视觉和物理属性通过交互获得的知识,我们的方法可以在没有强烈视觉提示的情况下识别液体物体,从最佳视觉Only变体的69.0%精度增加到了86.0%的精度。
https://arxiv.org/abs/2404.06904
The Embodied AI community has made significant strides in visual navigation tasks, exploring targets from 3D coordinates, objects, language descriptions, and images. However, these navigation models often handle only a single input modality as the target. With the progress achieved so far, it is time to move towards universal navigation models capable of handling various goal types, enabling more effective user interaction with robots. To facilitate this goal, we propose GOAT-Bench, a benchmark for the universal navigation task referred to as GO to AnyThing (GOAT). In this task, the agent is directed to navigate to a sequence of targets specified by the category name, language description, or image in an open-vocabulary fashion. We benchmark monolithic RL and modular methods on the GOAT task, analyzing their performance across modalities, the role of explicit and implicit scene memories, their robustness to noise in goal specifications, and the impact of memory in lifelong scenarios.
实体增强AI社区在视觉导航任务方面取得了显著进展,探索了从3D坐标、物体、语言描述和图像中的目标。然而,这些导航模型通常仅处理单个输入模态作为目标。随着取得的进展,是时候朝通用导航模型迈进,这些模型能够处理各种目标类型,使机器人与用户更有效地互动。为了实现这一目标,我们提出了GOAT-Bench,一个被称为GO to AnyThing(GOAT)的通用导航任务的基准。在这个任务中,代理被指导以按照类别名称、语言描述或图像中的目标导航到一系列指定目标。我们对GOAT任务上的单元化RL和模块化方法进行了基准,分析它们在各个方面的表现,包括模态、显式和隐式场景记忆的作用、对目标规格中噪声的鲁棒性以及记忆在终身场景中的影响。
https://arxiv.org/abs/2404.06609
In the field of visual affordance learning, previous methods mainly used abundant images or videos that delineate human behavior patterns to identify action possibility regions for object manipulation, with a variety of applications in robotic tasks. However, they encounter a main challenge of action ambiguity, illustrated by the vagueness like whether to beat or carry a drum, and the complexities involved in processing intricate scenes. Moreover, it is important for human intervention to rectify robot errors in time. To address these issues, we introduce Self-Explainable Affordance learning (SEA) with embodied caption. This innovation enables robots to articulate their intentions and bridge the gap between explainable vision-language caption and visual affordance learning. Due to a lack of appropriate dataset, we unveil a pioneering dataset and metrics tailored for this task, which integrates images, heatmaps, and embodied captions. Furthermore, we propose a novel model to effectively combine affordance grounding with self-explanation in a simple but efficient manner. Extensive quantitative and qualitative experiments demonstrate our method's effectiveness.
在视觉启发学习领域,以前的方法主要使用丰富的人或行为图像或视频来确定对象操作可能性区域,应用于机器人任务。然而,它们遇到了一个主要挑战是动作模糊,如图所示,是否打鼓或搬运,以及处理复杂场景涉及的复杂性。此外,人类干预还应该在时间上纠正机器人错误。为解决这些问题,我们引入了具有身体旁注的启发式学习(SEA)方法。这种创新使机器人能够表达其意图,并弥合可解释视觉-语言注释和视觉启发学习的差距。由于缺乏适当的数据集,我们揭示了专门针对这一任务的先驱数据集和指标,整合了图像、热图和身体旁注。此外,我们提出了一个新模型,以有效地将启示性 grounding 与自我解释相结合。 extensive 的定量实验和定性实验证实了我们方法的的有效性。
https://arxiv.org/abs/2404.05603
Prominent large language models have exhibited human-level performance in many domains, even enabling the derived agents to simulate human and social interactions. While practical works have substantiated the practicability of grounding language agents in sandbox simulation or embodied simulators, current social intelligence benchmarks either stay at the language level or use subjective metrics. In pursuit of a more realistic and objective evaluation, we introduce the Social Tasks in Sandbox Simulation (STSS) benchmark, which assesses language agents \textbf{objectively} at the \textbf{action level} by scrutinizing the goal achievements within the multi-agent simulation. Additionally, we sample conversation scenarios to build a language-level benchmark to provide an economically prudent preliminary evaluation and align with prevailing benchmarks. To gauge the significance of agent architecture, we implement a target-driven planning (TDP) module as an adjunct to the existing agent. Our evaluative findings highlight that the STSS benchmark is challenging for state-of-the-art language agents. Furthermore, it effectively discriminates between distinct language agents, suggesting its usefulness as a benchmark for evaluating both language models and agent architectures.
著名的大的语言模型已经在许多领域表现出与人类相当的表现,甚至使派生的代理能够模拟人类和社会互动。尽管实际工作已经证实将语言代理器 grounded 置于沙盒模拟或 embodied 模拟器上是可行的,但当前的社会智能基准要么停留在语言级别,要么使用主观指标。为了追求更真实和客观的评估,我们引入了 Social Tasks in Sandbox Simulation (STSS) 基准,该基准通过检查多智能体模拟中目标实现来客观评估语言代理器。此外,我们还采样对话场景来建立一个语言级别的基准,为经济上的谨慎初步评估并符合现有的基准。为了衡量代理架构的重要性,我们在现有的代理中实现了一个目标驱动规划(TDP)模块作为附加组件。我们的评估发现强调了 STSS 基准对最先进的语言代理机的挑战。此外,它有效地将不同的语言代理机区分开来,表明它作为评估语言模型和代理架构的基准具有实际价值。
https://arxiv.org/abs/2404.05337
Multimodal Large Language Models (MLLMs) have shown impressive reasoning abilities and general intelligence in various domains. It inspires researchers to train end-to-end MLLMs or utilize large models to generate policies with human-selected prompts for embodied agents. However, these methods exhibit limited generalization capabilities on unseen tasks or scenarios, and overlook the multimodal environment information which is critical for robots to make decisions. In this paper, we introduce a novel Robotic Multimodal Perception-Planning (RoboMP$^2$) framework for robotic manipulation which consists of a Goal-Conditioned Multimodal Preceptor (GCMP) and a Retrieval-Augmented Multimodal Planner (RAMP). Specially, GCMP captures environment states by employing a tailored MLLMs for embodied agents with the abilities of semantic reasoning and localization. RAMP utilizes coarse-to-fine retrieval method to find the $k$ most-relevant policies as in-context demonstrations to enhance the planner. Extensive experiments demonstrate the superiority of RoboMP$^2$ on both VIMA benchmark and real-world tasks, with around 10% improvement over the baselines.
多模态大型语言模型(MLLMs)在各种领域表现出惊人的推理能力和通用智能。这激发了研究人员训练端到端MLLMs或利用大型模型生成具有人类选择的提示的身体代理策略。然而,这些方法在未见过的任务或场景上表现出有限的泛化能力,并忽略了对于机器人做出决策至关重要的多模态环境信息。在本文中,我们引入了一种名为RoboMP$^2$的机器人多模态感知规划(RoboMP)框架,用于机器人操作。该框架包括一个由自适应MLLM捕获环境状态的目标条件式多模态感知器(GCMP)和一个用于增强规划器检索策略的检索增强多模态规划器(RAMP)。特别地,GCMP通过为具有语义推理和局部定位能力的身体代理使用定制的MLLM来捕获环境状态。RAMP利用粗到细的检索方法找到$k$个最有相关的策略,作为上下文的演示以提高规划器。大量实验证明,RoboMP$^2在VIMA基准和现实世界任务上具有优越性,与基线相比约10%的改进。
https://arxiv.org/abs/2404.04929
With the power of large language models (LLMs), open-ended embodied agents can flexibly understand human instructions, generate interpretable guidance strategies, and output executable actions. Nowadays, Multi-modal Language Models~(MLMs) integrate multi-modal signals into LLMs, further bringing richer perception to entity agents and allowing embodied agents to perceive world-understanding tasks more delicately. However, existing works: 1) operate independently by agents, each containing multiple LLMs, from perception to action, resulting in gaps between complex tasks and execution; 2) train MLMs on static data, struggling with dynamics in open-ended scenarios; 3) input prior knowledge directly as prompts, suppressing application flexibility. We propose STEVE-2, a hierarchical knowledge distillation framework for open-ended embodied tasks, characterized by 1) a hierarchical system for multi-granular task division, 2) a mirrored distillation method for parallel simulation data, and 3) an extra expert model for bringing additional knowledge into parallel simulation. After distillation, embodied agents can complete complex, open-ended tasks without additional expert guidance, utilizing the performance and knowledge of a versatile MLM. Extensive evaluations on navigation and creation tasks highlight the superior performance of STEVE-2 in open-ended tasks, with $1.4 \times$ - $7.3 \times$ in performance.
借助大型语言模型的力量,开放性代理可以灵活地理解人类指令,生成可解释的指导策略,并输出可执行动作。如今,多模态语言模型(MLMs)将多模态信号集成到LLM中,进一步增加了实体代理对复杂任务的感知,并允许实体代理更细致地感知世界理解任务。然而,现有作品:1)由代理独立操作,从感知到动作,导致复杂任务之间的缺口;2)在静态数据上训练MLMs,难以应对开放性场景中的动态;3)将先验知识直接作为提示输入,抑制了应用的灵活性。我们提出了STEVE-2,一个为开放性 embodied 任务提供层次化知识蒸馏框架,其特点为:1)多粒度任务分层的 hierarchical 系统,2)用于并行模拟数据的镜像蒸馏方法,3)用于引入额外知识的额外专家模型。蒸馏后,实体代理可以在没有额外专家指导的情况下完成复杂的、开放性的任务,利用多样 MLM 的性能和知识。对导航和创建任务的广泛评估强调了STEVE-2在开放性任务中的优越性能,性能比为 $1.4 \times$ - $7.3 \times$。
https://arxiv.org/abs/2404.04619
We present an embodied AI system which receives open-ended natural language instructions from a human, and controls two arms to collaboratively accomplish potentially long-horizon tasks over a large workspace. Our system is modular: it deploys state of the art Large Language Models for task planning,Vision-Language models for semantic perception, and Point Cloud transformers for grasping. With semantic and physical safety in mind, these modules are interfaced with a real-time trajectory optimizer and a compliant tracking controller to enable human-robot proximity. We demonstrate performance for the following tasks: bi-arm sorting, bottle opening, and trash disposal tasks. These are done zero-shot where the models used have not been trained with any real world data from this bi-arm robot, scenes or workspace.Composing both learning- and non-learning-based components in a modular fashion with interpretable inputs and outputs allows the user to easily debug points of failures and fragilities. One may also in-place swap modules to improve the robustness of the overall platform, for instance with imitation-learned policies.
我们提出了一个基于 embodied AI 系统的多模块体系结构,该系统接受人类提供的开放性自然语言指令,并控制两个臂协同完成可能涉及较长时间间隔的大型工作空间的任务。我们的系统具有模块化特性:它采用了最先进的自然语言处理模型进行任务规划,视觉语言模型进行语义感知,以及点云变换器进行抓取。在考虑语义和物理安全性的前提下,这些模块通过与实时轨迹优化器和符合跟踪控制器的接口相连,实现了与机器人的人机亲近。我们展示了以下任务的性能:双臂分类、开瓶和垃圾处理任务。这些任务是在没有训练过任何现实世界的数据的情况下完成的,包括这个双臂机器人、场景或工作空间。 以模块化方式,将学习和非学习组件组合在一起,并提供可解释的输入和输出,使用户能够轻松地诊断故障和脆弱点。还可以在本地交换模块以提高整个平台的健壮性,例如通过模仿学习策略。
https://arxiv.org/abs/2404.03570
Robotic technologies have been an indispensable part for improving human productivity since they have been helping humans in completing diverse, complex, and intensive tasks in a fast yet accurate and efficient way. Therefore, robotic technologies have been deployed in a wide range of applications, ranging from personal to industrial use-cases. However, current robotic technologies and their computing paradigm still lack embodied intelligence to efficiently interact with operational environments, respond with correct/expected actions, and adapt to changes in the environments. Toward this, recent advances in neuromorphic computing with Spiking Neural Networks (SNN) have demonstrated the potential to enable the embodied intelligence for robotics through bio-plausible computing paradigm that mimics how the biological brain works, known as "neuromorphic artificial intelligence (AI)". However, the field of neuromorphic AI-based robotics is still at an early stage, therefore its development and deployment for solving real-world problems expose new challenges in different design aspects, such as accuracy, adaptability, efficiency, reliability, and security. To address these challenges, this paper will discuss how we can enable embodied neuromorphic AI for robotic systems through our perspectives: (P1) Embodied intelligence based on effective learning rule, training mechanism, and adaptability; (P2) Cross-layer optimizations for energy-efficient neuromorphic computing; (P3) Representative and fair benchmarks; (P4) Low-cost reliability and safety enhancements; (P5) Security and privacy for neuromorphic computing; and (P6) A synergistic development for energy-efficient and robust neuromorphic-based robotics. Furthermore, this paper identifies research challenges and opportunities, as well as elaborates our vision for future research development toward embodied neuromorphic AI for robotics.
机器人技术在提高人类生产力的过程中一直是一个不可或缺的部分,因为他们以快速、准确、高效的方式帮助人类完成各种复杂、密集的任务。因此,机器人技术已经在个人到工业应用范围内得到了广泛应用。然而,目前的机器人技术和计算范式仍然缺乏肢体智能,无法有效地与操作环境互动,对正确的/预期动作作出反应,并适应环境变化。为此,近年来关于神经形态计算(SNN)的神经网络的进步展示了通过类生物计算范式实现机器人肢体智能的可能性,这种范式模仿了生物大脑的工作方式,被称为“神经形态人工智能(AI)”。然而,基于神经形态人工智能的机器人领域仍然处于早期阶段,因此其开发和部署为解决现实问题暴露了在设计方面的新挑战,例如准确性、适应性、效率、可靠性和安全性。为应对这些挑战,本文将探讨通过我们的观点如何实现机器人系统中的肢体智能神经形态人工智能(AI)的方法:(P1)基于有效学习规则、训练机制和适应性的肢体智能;(P2)跨层优化实现能源高效的神经形态计算;(P3)具有代表性且公正的基准;(P4)低成本可靠性和安全增强;(P5)神经形态计算的安全和隐私;(P6)能源效率高和鲁棒性强的神经形态机器人发展。此外,本文还识别出研究挑战和机会,并阐述了我们对未来机器人研究开发中肢体智能神经形态人工智能的愿景。
https://arxiv.org/abs/2404.03325
This paper explores the integration of linguistic inputs within robotic navigation systems, drawing upon the symbol interdependency hypothesis to bridge the divide between symbolic and embodied cognition. It examines previous work incorporating language and semantics into Neural Network (NN) and Simultaneous Localization and Mapping (SLAM) approaches, highlighting how these integrations have advanced the field. By contrasting abstract symbol manipulation with sensory-motor grounding, we propose a unified framework where language functions both as an abstract communicative system and as a grounded representation of perceptual experiences. Our review of cognitive models of distributional semantics and their application to autonomous agents underscores the transformative potential of language-integrated systems.
本文探讨了在机器人导航系统中集成语言输入的问题,并借鉴符号互依性假设来弥合符号和身体认知之间的分歧。它回顾了将语言和语义融入神经网络(NN)和同时定位与映射(SLAM)方法中的先驱工作,并强调了这些整合如何推动该领域的发展。通过将抽象符号操作与感知-运动 groundeding 相比较,我们提出了一个统一框架,其中语言既作为抽象交流系统,又作为感知经历的 grounded 表示。我们对分布式语义模型的认知模型及其应用于自主机器人的回顾强调了语言集成系统的变革潜力。
https://arxiv.org/abs/2404.03049
Recent advancements in language models have demonstrated their adeptness in conducting multi-turn dialogues and retaining conversational context. However, this proficiency remains largely unexplored in other multimodal generative models, particularly in human motion models. By integrating multi-turn conversations in controlling continuous virtual human movements, generative human motion models can achieve an intuitive and step-by-step process of human task execution for humanoid robotics, game agents, or other embodied systems. In this work, we present MotionChain, a conversational human motion controller to generate continuous and long-term human motion through multimodal prompts. Specifically, MotionChain consists of multi-modal tokenizers that transform various data types such as text, image, and motion, into discrete tokens, coupled with a Vision-Motion-aware Language model. By leveraging large-scale language, vision-language, and vision-motion data to assist motion-related generation tasks, MotionChain thus comprehends each instruction in multi-turn conversation and generates human motions followed by these prompts. Extensive experiments validate the efficacy of MotionChain, demonstrating state-of-the-art performance in conversational motion generation, as well as more intuitive manners of controlling and interacting with virtual humans.
近年来在语言模型的进步中,已经证明了它们在多轮对话和保留会话上下文方面的能力。然而,在多模态生成模型中,尤其是在人体运动模型中,这种能力仍然没有被广泛探索。通过将多轮对话集成到控制连续虚拟人运动中,生成式人体运动模型可以实现人类任务执行的直观和逐步过程,适用于人形机器人、游戏代理或其他嵌入式系统。在这项工作中,我们提出了MotionChain,一种通过多模态提示生成连续和长时间的人体运动控制器。具体来说,MotionChain由多模态词表组成,将各种数据类型如文本、图像和运动转换为离散的标记,并与Vision-Motion- aware 语言模型耦合。通过利用大规模语言、视觉语言和视觉运动数据来辅助运动相关生成任务,MotionChain因此可以理解多轮对话中的每个指令,并生成跟随这些提示的人体运动。大量实验证实了MotionChain的有效性,证明了其在会话运动生成方面的卓越性能,以及更直观的人机交互方式。
https://arxiv.org/abs/2404.01700