Despite a widespread success in various applications, large language models (LLMs) often stumble when tackling basic physical reasoning or executing robotics tasks, due to a lack of direct experience with the physical nuances of the real world. To address these issues, we propose a Grounding Large language model with Imperfect world MOdel (GLIMO), which utilizes proxy world models such as simulators to collect and synthesize trining data. GLIMO incorporates an LLM agent-based data generator to automatically create high-quality and diverse instruction datasets. The generator includes an iterative self-refining module for temporally consistent experience sampling, a diverse set of question-answering instruction seeds, and a retrieval-augmented generation module for reflecting on prior experiences. Comprehensive experiments show that our approach improve the performance of strong open-source LLMs like LLaMA-3 with a performance boost of 2.04 $\times$, 1.54 $\times$, and 1.82 $\times$ across three different benchmarks, respectively. The performance is able to compete with or surpass their larger counterparts such as GPT-4.
尽管在各种应用中取得了广泛的成功,大型语言模型(LLMs)在处理基本的物理推理或执行机器人任务时常常会陷入困境,因为它们缺乏与现实世界物理细微差别的第一手经验。为解决这些问题,我们提出了一个基于代理世界模型的接地大型语言模型(GLIMO),该模型利用模拟器等代理世界模型收集和合成训练数据。GLIMO包括一个基于LLM的代理程序数据生成器,用于自动创建高质量和多样化的指令数据集。生成器包括一个迭代自校正的时序一致性经验采样模块、一个多样的问题回答指令种子集和一个反映先验经验的检索增强生成模块。 全面的实验证明,我们的方法在三个不同的基准测试中分别将LLM-3的性能提高了2.04倍、1.54倍和1.82倍。性能能够与或超过其较大 counterparts(如 GPT-4)竞争,甚至有些超过它们。
https://arxiv.org/abs/2410.02742
Object navigation in unknown environments is crucial for deploying embodied agents in real-world applications. While we have witnessed huge progress due to large-scale scene datasets, faster simulators, and stronger models, previous studies mainly focus on limited scene types and target objects. In this paper, we study a new task of navigating to diverse target objects in a large number of scene types. To benchmark the problem, we present a large-scale scene dataset, DivScene, which contains 4,614 scenes across 81 different types. With the dataset, we build an end-to-end embodied agent, NatVLM, by fine-tuning a Large Vision Language Model (LVLM) through imitation learning. The LVLM is trained to take previous observations from the environment and generate the next actions. We also introduce CoT explanation traces of the action prediction for better performance when tuning LVLMs. Our extensive experiments find that we can build a performant LVLM-based agent through imitation learning on the shortest paths constructed by a BFS planner without any human supervision. Our agent achieves a success rate that surpasses GPT-4o by over 20%. Meanwhile, we carry out various analyses showing the generalization ability of our agent.
在未知环境中进行物体导航对于在现实应用中部署感知代理至关重要。虽然我们通过大规模场景数据、高速模拟器和更强的模型见证了巨大的进步,但之前的研究主要集中在有限的场景类型和目标对象上。在本文中,我们研究了在多种场景类型中导航到多个目标对象的新任务。为了进行基准测试,我们提出了一个大型场景数据集DivScene,包含81个不同类型的场景,总共有4,614个场景。通过这个数据集,我们在基于模仿学习对一个大视觉语言模型(LVLM)进行微调,以构建端到端的 embodied 代理,NatVLM。LVLM 通过从环境中获取先前的观察并生成下一个动作进行训练。我们还引入了动作预测的 CoT 轨迹,以便在调整 LVLMs 时获得更好的性能。通过广泛的实验,我们发现,在没有人类监督的情况下,通过模仿学习可以在最短路径上构建的 BFS 规划器上构建出高性能的 LVLM 代理。我们的代理在 GPT-4o 上的成功率超过了20%。同时,我们进行了各种分析,证明了我们的代理具有很强的泛化能力。
https://arxiv.org/abs/2410.02730
The systematic evaluation of speech separation and enhancement models under moving sound source conditions typically requires extensive data comprising diverse scenarios. However, real-world datasets often contain insufficient data to meet the training and evaluation requirements of models. Although synthetic datasets offer a larger volume of data, their acoustic simulations lack realism. Consequently, neither real-world nor synthetic datasets effectively fulfill practical needs. To address these issues, we introduce SonicSim, a synthetic toolkit de-designed to generate highly customizable data for moving sound sources. SonicSim is developed based on the embodied AI simulation platform, Habitat-sim, supporting multi-level adjustments, including scene-level, microphone-level, and source-level, thereby generating more diverse synthetic data. Leveraging SonicSim, we constructed a moving sound source benchmark dataset, SonicSet, using the Librispeech, the Freesound Dataset 50k (FSD50K) and Free Music Archive (FMA), and 90 scenes from the Matterport3D to evaluate speech separation and enhancement models. Additionally, to validate the differences between synthetic data and real-world data, we randomly selected 5 hours of raw data without reverberation from the SonicSet validation set to record a real-world speech separation dataset, which was then compared with the corresponding synthetic datasets. Similarly, we utilized the real-world speech enhancement dataset RealMAN to validate the acoustic gap between other synthetic datasets and the SonicSet dataset for speech enhancement. The results indicate that the synthetic data generated by SonicSim can effectively generalize to real-world scenarios. Demo and code are publicly available at this https URL.
为了评估在移动声源条件下语音分离和增强模型的系统有效性,通常需要大量数据,包括各种情景。然而,现实世界数据集通常不足以满足模型的训练和评估需求。尽管合成数据集提供了更大的数据量,但它们的声学模拟缺乏现实感。因此,现实世界和合成数据集都不能有效满足实际需求。为解决这些问题,我们引入了SonicSim,一个专门为移动声源生成高度定制化数据的合成工具包。SonicSim基于人体人工智能模拟平台Habitat-sim开发,支持多级调整,包括场景级、麦克风级和源级,从而生成更丰富的合成数据。利用SonicSim,我们构建了一个移动声源基准数据集SonicSet,使用Librispeech、FSD50K和Free Music Archive(FMA)中的90个场景,以及Matterport3D中的50个场景,评估语音分离和增强模型。此外,为了验证合成数据和真实世界数据之间的差异,我们随机选择FSD50K验证集中没有回声的5个小时原始数据,记录了一个真实的语音分离数据集,然后将其与相应的合成数据集进行比较。同样,我们还利用现实世界增强数据集RealMAN验证其他合成数据集和SonicSet数据集之间的声学差距。结果显示,SonicSim生成的合成数据可以有效扩展到现实世界场景。演示和代码都可以在https://这个URL上获取。
https://arxiv.org/abs/2410.01481
Embodied AI is a rapidly advancing field that bridges the gap between cyberspace and physical space, enabling a wide range of applications. This evolution has led to the development of the Vehicular Embodied AI NETwork (VEANET), where advanced AI capabilities are integrated into vehicular systems to enhance autonomous operations and decision-making. Embodied agents, such as Autonomous Vehicles (AVs), are autonomous entities that can perceive their environment and take actions to achieve specific goals, actively interacting with the physical world. Embodied twins are digital models of these embodied agents, with various embodied AI twins for intelligent applications in cyberspace. In VEANET, embodied AI twins act as in-vehicle AI assistants to perform diverse tasks supporting autonomous driving using generative AI models. Due to limited computational resources of AVs, these AVs often offload computationally intensive tasks, such as constructing and updating embodied AI twins, to nearby RSUs. However, since the rapid mobility of AVs and the limited provision coverage of a single RSU, embodied AI twins require dynamic migrations from current RSU to other RSUs in real-time, resulting in the challenge of selecting suitable RSUs for efficient embodied AI twins migrations. Given information asymmetry, AVs cannot know the detailed information of RSUs. To this end, in this paper, we construct a multi-dimensional contract theoretical model between AVs and alternative RSUs. Considering that AVs may exhibit irrational behavior, we utilize prospect theory instead of expected utility theory to model the actual utilities of AVs. Finally, we employ a generative diffusion model-based algorithm to identify the optimal contract designs. Compared with traditional deep reinforcement learning algorithms, numerical results demonstrate the effectiveness of the proposed scheme.
身体化人工智能是一个迅速发展的领域,将虚拟空间与物理空间相连接,为各种应用提供了便利。这一发展导致了对车辆身体化人工智能网络(VEANET)的开发,将先进的人工智能功能集成到车辆系统中,以提高自动驾驶的自主操作和决策能力。身体化代理,如自动驾驶车辆(AVs),是具有自主感知环境并采取行动以实现特定目标的自主实体,积极地与物理世界进行交互。身体化双胞胎是这些身体化代理的数字模型,为智能应用提供各种身体化人工智能双胞胎。在VEANET中,身体化人工智能双胞胎作为车载人工智能助手执行各种任务,支持使用生成式人工智能模型进行自动驾驶。由于AV的计算资源有限,这些AV通常将计算密集型任务,如构建和更新身体化人工智能双胞胎,下放到附近的RSU。然而,由于AV的快速移动和单个RSU的覆盖范围有限,身体化人工智能双胞胎需要实时从当前RSU迁移到其他RSU,导致选择合适的RSU作为高效身体化人工智能双胞胎迁移具有挑战性。考虑到信息不对称,AV无法获得RSU的详细信息。为此,在本文中,我们构建了AV和替代RSU之间的多维度契约理论模型。考虑到AV可能表现出非理性的行为,我们使用前景理论而不是期望效用理论来建模AV的实际效用。最后,我们采用基于生成扩散模型的算法来确定最优契约设计。与传统深度强化学习算法相比,数值结果证明了所提出的方案的有效性。
https://arxiv.org/abs/2410.01176
Simulation has enabled unprecedented compute-scalable approaches to robot learning. However, many existing simulation frameworks typically support a narrow range of scenes/tasks and lack features critical for scaling generalizable robotics and sim2real. We introduce and open source ManiSkill3, the fastest state-visual GPU parallelized robotics simulator with contact-rich physics targeting generalizable manipulation. ManiSkill3 supports GPU parallelization of many aspects including simulation+rendering, heterogeneous simulation, pointclouds/voxels visual input, and more. Simulation with rendering on ManiSkill3 can run 10-1000x faster with 2-3x less GPU memory usage than other platforms, achieving up to 30,000+ FPS in benchmarked environments due to minimal python/pytorch overhead in the system, simulation on the GPU, and the use of the SAPIEN parallel rendering system. Tasks that used to take hours to train can now take minutes. We further provide the most comprehensive range of GPU parallelized environments/tasks spanning 12 distinct domains including but not limited to mobile manipulation for tasks such as drawing, humanoids, and dextrous manipulation in realistic scenes designed by artists or real-world digital twins. In addition, millions of demonstration frames are provided from motion planning, RL, and teleoperation. ManiSkill3 also provides a comprehensive set of baselines that span popular RL and learning-from-demonstrations algorithms.
模拟使机器人学习达到史无前例的计算可扩展方法。然而,许多现有的模拟框架通常仅支持有限的场景/任务,并缺乏用于扩展通用机器人学和仿真2实体的关键功能。我们介绍并开源了ManiSkill3,这是最快具有接触丰富物理的并行机器人模拟器,旨在实现可扩展的通用操作。ManiSkill3支持包括模拟+渲染、异构模拟、点云/体素视觉输入在内的一切方面的GPU并行。使用ManiSkill3进行模拟渲染可以在比其他平台快10-1000倍的时间内运行,同时使用2-3倍的更少的GPU内存,达到基准环境中的30,000+ FPS。由于系统中的Python/PyTorch开销非常小,因此模拟在GPU上进行,并使用SAPIEN并行渲染系统,训练任务通常需要数小时的时间。现在,我们可以提供12个不同的GPU并行环境/任务,包括但不仅限于移动操作,用于任务 such as 绘画、人形和现实世界的数字孪生。此外,我们还提供了数百万个演示帧,来自运动规划、RL和遥控。ManiSkill3还提供了流行RL和学习从演示算法的全面基础。
https://arxiv.org/abs/2410.00425
Suppose you are at your desk looking at some objects on it. You don't know the precise distance from your eye to any particular object in meters. However, you can immediately reach out and touch any of them. Instead of the meter, your knowledge of distance is encoded in unknown but embodied units of action. In contrast, standard approaches in robotics assume calibration to the meter, so that separated vision and control processes can be interfaced. Consequently, robots are precisely manufactured and calibrated, resulting in expensive systems available in only a few configurations. In response, we propose Embodied Visuomotor Representation, a framework that allows distance to be measured by a robot's own actions and thus minimizes dependence on calibrated 3D sensors and physical models. Using it, we demonstrate that a robot without knowledge of its size, environmental scale, or its own strength can become capable of touching and clearing obstacles after several seconds of operation. Similarly, we demonstrate in simulation that an agent, without knowledge of its mass or strength, can jump a gap of unknown size after performing a few test oscillations. These experiments parallel bee and gerbil behavior, respectively.
假设您坐在桌子前看着桌子上的某些物体。您不知道您眼睛距离桌子上的任何物体的确切距离,但是您可以立即触摸到其中的任何一个物体。与米相比,您对距离的知识是以未知但具有身体单位的动作编码的。相比之下,机器人学中的标准方法假设校准到米,以便实现视觉和控制过程的分离。因此,机器人被精确制造和校准,结果只有几种配置的昂贵系统可供选择。为了回应这个问题,我们提出了Embodied Visuomotor Representation,一个框架,允许机器人通过自己的动作测量距离,从而最小化与校准的3D传感器和物理模型的依赖。使用它,我们证明了在没有其大小、环境规模或自身强度知识的情况下,机器人可以在操作数秒后触摸和清除障碍物。同样,在仿真中,我们还证明了没有知识其质量和力量的代理可以在执行几次测试振荡后跃过未知大小的缺口。这些实验与蜜蜂和仓鼠的行为分别平行。
https://arxiv.org/abs/2410.00287
How is language related to consciousness? Language functions to categorise perceptual experiences (e.g., labelling interoceptive states as 'happy') and higher-level constructs (e.g., using 'I' to represent the narrative self). Psychedelic use and meditation might be described as altered states that impair or intentionally modify the capacity for linguistic categorisation. For example, psychedelic phenomenology is often characterised by 'oceanic boundlessness' or 'unity' and 'ego dissolution', which might be expected of a system unburdened by entrenched language categories. If language breakdown plays a role in producing such altered behaviour, multimodal artificial intelligence might align more with these phenomenological descriptions when attention is shifted away from language. We tested this hypothesis by comparing the semantic embedding spaces from simulated altered states after manipulating attentional weights in CLIP and FLAVA models to embedding spaces from altered states questionnaires before manipulation. Compared to random text and various other altered states including anxiety, models were more aligned with disembodied, ego-less, spiritual, and unitive states, as well as minimal phenomenal experiences, with decreased attention to language and vision. Reduced attention to language was associated with distinct linguistic patterns and blurred embeddings within and, especially, across semantic categories (e.g., 'giraffes' become more like 'bananas'). These results lend support to the role of language categorisation in the phenomenology of altered states of consciousness, like those experienced with high doses of psychedelics or concentration meditation, states that often lead to improved mental health and wellbeing.
语言与意识有何关系?语言用于对感知体验进行分类(例如,将内感觉状态标签为“快乐”),以及更高层次的概念(例如,使用“我”来表示叙述自我)。使用幻觉剂和冥想可能被视为削弱或有意修改语言分类能力的 alter states。例如,幻觉现象学通常表现出“海洋无尽”或“统一”和“自我消失”,这可能是一个不受语言类别困扰的系统的预期表现。如果语言 breakdown 在产生这种改变行为中发挥作用,那么多模态人工智能在注意力从语言转向时可能更符合这些现象学描述。我们通过比较在CLIP和FLAVA模型中,对注意力的调整后模拟改变状态的语义嵌入空间与调整前模拟状态的语义嵌入空间,来测试这个假设。与随机文本和各种其他改变状态(包括焦虑)相比,模型更倾向于与去人格化、无自我、超然和统一状态,以及最小化现象体验相结合,同时对语言和视觉的注意减少。降低对语言的关注与语义类别的模糊模糊有关(例如,“长颈鹿”变得更像“香蕉”)。这些结果支持语言分类在意识状态的现象学中起作用,如同高剂量幻觉剂或专注冥想时体验到的状态,这些状态往往与改善心理健康和幸福感相结合。
https://arxiv.org/abs/2410.00257
Grounding 3D scene affordance aims to locate interactive regions in 3D environments, which is crucial for embodied agents to interact intelligently with their surroundings. Most existing approaches achieve this by mapping semantics to 3D instances based on static geometric structure and visual appearance. This passive strategy limits the agent's ability to actively perceive and engage with the environment, making it reliant on predefined semantic instructions. In contrast, humans develop complex interaction skills by observing and imitating how others interact with their surroundings. To empower the model with such abilities, we introduce a novel task: grounding 3D scene affordance from egocentric interactions, where the goal is to identify the corresponding affordance regions in a 3D scene based on an egocentric video of an interaction. This task faces the challenges of spatial complexity and alignment complexity across multiple sources. To address these challenges, we propose the Egocentric Interaction-driven 3D Scene Affordance Grounding (Ego-SAG) framework, which utilizes interaction intent to guide the model in focusing on interaction-relevant sub-regions and aligns affordance features from different sources through a bidirectional query decoder mechanism. Furthermore, we introduce the Egocentric Video-3D Scene Affordance Dataset (VSAD), covering a wide range of common interaction types and diverse 3D environments to support this task. Extensive experiments on VSAD validate both the feasibility of the proposed task and the effectiveness of our approach.
grounded 3D scene affordance旨在在3D环境中定位交互区域,这对实体代理与周围环境进行智能交互至关重要。大多数现有方法通过基于静态几何结构和视觉外观将语义映射到3D实例来实现这一目标。这种被动策略限制了代理主动感知和参与环境的能力,使其依赖于预定义的语义指令。相比之下,人类通过观察和模仿他人如何与周围环境交互来发展复杂的交互技能。为了赋予模型这种能力,我们提出了一个新颖的任务:从以自我为中心的交互中进行 grounded 3D scene affordance 的 grounding,其中目标是根据一个以自我为中心的交互视频来识别3D场景中的相应 affordance 区域。这个任务面临来自多个来源的空间复杂性和对齐复杂性。为解决这些挑战,我们提出了一个基于交互意图的Egocentric Interaction-driven 3D Scene Affordance Grounding(Ego-SAG)框架。该框架利用交互意图指导模型关注交互相关的子区域,并通过双向查询解码机制将来自不同来源的语义特征对齐。此外,我们还引入了Egocentric Video-3D Scene Affordance Dataset(VSAD),涵盖了各种常见的交互类型和各种3D环境,以支持这项任务。对VSAD的广泛实验验证了所提出的任务的可行性和我们的方法的的有效性。
https://arxiv.org/abs/2409.19650
Humans can perform complex tasks with long-term objectives by planning, reasoning, and forecasting outcomes of actions. For embodied agents to achieve similar capabilities, they must gain knowledge of the environment transferable to novel scenarios with a limited budget of additional trial and error. Learning-based approaches, such as deep RL, can discover and take advantage of inherent regularities and characteristics of the application domain from data, and continuously improve their performances, however at a cost of large amounts of training data. This thesis explores the development of data-driven techniques for spatial reasoning and planning tasks, focusing on enhancing learning efficiency, interpretability, and transferability across novel scenarios. Four key contributions are made. 1) CALVIN, a differential planner that learns interpretable models of the world for long-term planning. It successfully navigated partially observable 3D environments, such as mazes and indoor rooms, by learning the rewards and state transitions from expert demonstrations. 2) SOAP, an RL algorithm that discovers options unsupervised for long-horizon tasks. Options segment a task into subtasks and enable consistent execution of the subtask. SOAP showed robust performances on history-conditional corridor tasks as well as classical benchmarks such as Atari. 3) LangProp, a code optimisation framework using LLMs to solve embodied agent problems that require reasoning by treating code as learnable policies. The framework successfully generated interpretable code with comparable or superior performance to human-written experts in the CARLA autonomous driving benchmark. 4) Voggite, an embodied agent with a vision-to-action transformer backend that solves complex tasks in Minecraft. It achieved third place in the MineRL BASALT Competition by identifying action triggers to segment tasks into multiple stages.
人类可以通过计划、推理和预测动作的后果来执行复杂任务。为了使具有类似能力的嵌入式代理实现这些目标,他们必须通过有限的额外尝试和误操作预算了解可转移的环境知识。基于学习的方法,如深度强化学习(deep RL),可以从数据中发现应用领域固有的规律和特征,并持续提高其性能,然而需要大量训练数据。本文探讨了用于空间推理和规划任务的基于数据的技术的开发,重点提高在新场景中的学习效率、可解释性和可转移性。四个关键贡献包括:1)CALVIN,一个用于长期规划的世界可解释模型差分规划器。它通过从专家演示中学习奖励和状态转移,成功导航了部分可观测的3D环境,如迷宫和室内房间。2)SOAP,一个无监督的长期 horizon 任务选项的 RL 算法。选项将任务分割为子任务,并使子任务可以按顺序执行。SOAP 在历史条件下的走廊任务以及经典基准如 Atari 上的表现都十分稳健。3)LangProp,一个使用LLM的代码优化框架,用于解决需要通过将代码视为可学习策略来推理的 embodied 代理问题。该框架成功为CARLA自动驾驶基准中的人类写专家生成的可解释代码,其性能与人类编写的专家相当或更好。4)Voggite,一个具有视觉到动作变换后端的 embodied 代理,在 Minecraft 中解决复杂任务。它通过识别动作触发器,将任务分割成多个阶段,在 MineRL BASALT 竞赛中获得了第三名。
https://arxiv.org/abs/2409.19479
In this paper, we develop an embodied AI system for human-in-the-loop navigation with a wheeled mobile robot. We propose a direct yet effective method of monitoring the robot's current plan to detect changes in the environment that impact the intended trajectory of the robot significantly and then query a human for feedback. We also develop a means to parse human feedback expressed in natural language into local navigation waypoints and integrate it into a global planning system, by leveraging a map of semantic features and an aligned obstacle map. Extensive testing in simulation and physical hardware experiments with a resource-constrained wheeled robot tasked to navigate in a real-world environment validate the efficacy and robustness of our method. This work can support applications like precision agriculture and construction, where persistent monitoring of the environment provides a human with information about the environment state.
在本文中,我们开发了一种以轮式移动机器人为人类反馈的 embodied AI 系统。我们提出了一个直接且有效的监测机器人当前计划的方案,以检测影响机器人意图轨迹的环境变化,然后向人类寻求反馈。我们还开发了一种将人类用自然语言表达的反馈解析为局部导航坐标并将其集成到全局规划系统的方法,通过利用语义特征的地图和与障碍地图对齐的地图。在资源受限的轮式机器人上进行模拟和实物硬件实验验证了我们的方法的实效性和稳健性。本研究可以为诸如精密农业和建筑等领域提供支持,在这些领域,持续的环境监测为人类提供了关于环境状态的信息。
https://arxiv.org/abs/2409.19459
Visual Place Recognition (VPR) is a crucial component of many visual localization pipelines for embodied agents. VPR is often formulated as an image retrieval task aimed at jointly learning local features and an aggregation method. The current state-of-the-art VPR methods rely on VLAD aggregation, which can be trained to learn a weighted contribution of features through their soft assignment to cluster centers. However, this process has two key limitations. Firstly, the feature-to-cluster weighting does not account for over-represented repetitive structures within a cluster, e.g., shadows or window panes; this phenomenon is also referred to as the `burstiness' problem, classically solved by discounting repetitive features before aggregation. Secondly, feature to cluster comparisons are compute-intensive for state-of-the-art image encoders with high-dimensional local features. This paper addresses these limitations by introducing VLAD-BuFF with two novel contributions: i) a self-similarity based feature discounting mechanism to learn Burst-aware features within end-to-end VPR training, and ii) Fast Feature aggregation by reducing local feature dimensions specifically through PCA-initialized learnable pre-projection. We benchmark our method on 9 public datasets, where VLAD-BuFF sets a new state of the art. Our method is able to maintain its high recall even for 12x reduced local feature dimensions, thus enabling fast feature aggregation without compromising on recall. Through additional qualitative studies, we show how our proposed weighting method effectively downweights the non-distinctive features. Source code: this https URL.
视觉空间识别(VPR)是许多采用实体代理的视觉定位管道中至关重要的组成部分。VPR通常被建模为一种旨在共同学习局部特征和聚合方法的图像检索任务。目前最先进的VPR方法依赖于VLAD聚合,该方法可以通过将软分配给聚类中心来学习特征的加权贡献。然而,这一过程有两个关键限制。首先,特征到聚类的权重没有考虑到聚类内过表示的重复结构,例如阴影或窗户玻璃;这种现象被称为“爆发”问题,经典地通过在聚合之前折扣重复特征来解决。其次,对于具有高维局部特征的高级图像编码器,特征到聚类的比较是计算密集的。本文通过引入VLAD-BuFF来解决这两个限制:i)基于自相似性的特征折扣机制,以学习在端到端VPR训练中的爆发性特征;ii)通过PCA初始化的可学习预投影进行局部特征维度的降低来加速特征聚合。我们在9个公共数据集上进行了基准测试,其中VLAD-BuFF创造了最先进的状态。我们的方法能够在即使局部特征维度降低12倍的情况下仍具有高召回率,从而实现快速特征聚合而不会牺牲召回率。通过进行进一步的定性研究,我们证明了我们的加权方法有效地降低了非区分性特征。源代码:https:// this URL.
https://arxiv.org/abs/2409.19293
In recent years, Embodied Artificial Intelligence (Embodied AI) has advanced rapidly, yet the increasing size of models conflicts with the limited computational capabilities of Embodied AI platforms. To address this challenge, we aim to achieve both high model performance and practical deployability. Specifically, we focus on Vision-and-Language Navigation (VLN), a core task in Embodied AI. This paper introduces a two-stage knowledge distillation framework, producing a student model, MiniVLN, and showcasing the significant potential of distillation techniques in developing lightweight models. The proposed method aims to capture fine-grained knowledge during the pretraining phase and navigation-specific knowledge during the fine-tuning phase. Our findings indicate that the two-stage distillation approach is more effective in narrowing the performance gap between the teacher model and the student model compared to single-stage distillation. On the public R2R and REVERIE benchmarks, MiniVLN achieves performance on par with the teacher model while having only about 12% of the teacher model's parameter count.
近年来,Embodied Artificial Intelligence(Embodied AI)取得了快速发展,然而随着模型的不断复杂化,Embodied AI平台的计算能力受限。为解决这个问题,我们的目标是实现高模型性能和实际部署的可行性。具体来说,我们关注视觉与语言导航(VLN)这一核心任务,也是Embodied AI中的一项重要研究。本文引入了一个两阶段知识蒸馏框架,产生了一个学生模型MiniVLN,并展示了蒸馏技术在开发轻量模型方面的巨大潜力。所提出的方法旨在在预训练阶段捕捉细粒度知识,在微调阶段捕捉与导航相关的知识。我们的研究结果表明,与单阶段蒸馏方法相比,两阶段蒸馏方法在缩小教师模型和学生模型之间的性能差距方面更加有效。在公开的R2R和REVERIE基准上,MiniVLN在参数数量仅为教师模型的约12%时,其性能与教师模型相当。
https://arxiv.org/abs/2409.18800
There is no limit to how much a robot might explore and learn, but all of that knowledge needs to be searchable and actionable. Within language research, retrieval augmented generation (RAG) has become the workhouse of large-scale non-parametric knowledge, however existing techniques do not directly transfer to the embodied domain, which is multimodal, data is highly correlated, and perception requires abstraction. To address these challenges, we introduce Embodied-RAG, a framework that enhances the foundational model of an embodied agent with a non-parametric memory system capable of autonomously constructing hierarchical knowledge for both navigation and language generation. Embodied-RAG handles a full range of spatial and semantic resolutions across diverse environments and query types, whether for a specific object or a holistic description of ambiance. At its core, Embodied-RAG's memory is structured as a semantic forest, storing language descriptions at varying levels of detail. This hierarchical organization allows the system to efficiently generate context-sensitive outputs across different robotic platforms. We demonstrate that Embodied-RAG effectively bridges RAG to the robotics domain, successfully handling over 200 explanation and navigation queries across 19 environments, highlighting its promise for general-purpose non-parametric system for embodied agents.
机器人可以探索和学习无限程度,但所有这些知识都需要是可搜索的和可行动的。在自然语言处理领域,检索增强生成(RAG)已成为大型非参数知识的大型框架,然而现有的技术并不直接应用于身体领域,这是一个多模态的领域,数据高度相关,感知需要抽象。为了应对这些挑战,我们引入了Embodied-RAG,一个增强身体代理模型的基本模型,并具有非参数记忆系统,能够自主构建知识层次结构来进行导航和语言生成。Embodied-RAG处理了各种环境中的完整空间和语义分辨率,无论是特定对象还是整个氛围的描述。在其核心,Embodied-RAG的内存组织成一个语义森林,保存语言描述的不同详细程度。这种层次结构允许系统在不同的机器人平台上生成上下文敏感的输出。我们证明了Embodied-RAG有效地将RAG与机器人领域相结合,在19个环境中成功处理了超过200个解释和导航查询,突出了它对通用非参数身体代理系统的潜在承诺。
https://arxiv.org/abs/2409.18313
For AI agents to be helpful to humans, they should be able to follow natural language instructions to complete everyday cooperative tasks in human environments. However, real human instructions inherently possess ambiguity, because the human speakers assume sufficient prior knowledge about their hidden goals and intentions. Standard language grounding and planning methods fail to address such ambiguities because they do not model human internal goals as additional partially observable factors in the environment. We propose a new framework, Follow Instructions with Social and Embodied Reasoning (FISER), aiming for better natural language instruction following in collaborative embodied tasks. Our framework makes explicit inferences about human goals and intentions as intermediate reasoning steps. We implement a set of Transformer-based models and evaluate them over a challenging benchmark, HandMeThat. We empirically demonstrate that using social reasoning to explicitly infer human intentions before making action plans surpasses purely end-to-end approaches. We also compare our implementation with strong baselines, including Chain of Thought prompting on the largest available pre-trained language models, and find that FISER provides better performance on the embodied social reasoning tasks under investigation, reaching the state-of-the-art on HandMeThat.
为使AI助手对人类有益,它们应能遵循自然语言指令在人类环境中完成日常合作任务。然而,真实的人类指令固有不确定性,因为人类说话者假定具有足够的先验知识来了解他们的隐藏目标和意图。标准的语言建模和规划方法无法解决这种不确定性,因为它们没有将人类内部目标建模为环境中观察到的部分不可观察的因素。我们提出了一个名为FISER的新框架,旨在提高在协作身体任务中遵循自然语言指令的性能。我们的框架在中间推理步骤中明确推断人类目标和意图。我们实现了一系列基于Transformer的模型,并在具有挑战性的基准 HandMeThat 上评估它们。我们通过实验实证证明,通过使用社会推理在做出行动计划之前明确推断人类意图超过了仅关注最终目标的纯终点方法。我们还比较了我们的实现与强大的基线,包括对预训练语言模型的 Chain of Thought 提示,并发现FISER 在调查的身体社交推理任务中提供了更好的性能,达到了 HandMeThat 的最先进水平。
https://arxiv.org/abs/2409.18073
Accurately recognizing a revisited place is crucial for embodied agents to localize and navigate. This requires visual representations to be distinct, despite strong variations in camera viewpoint and scene appearance. Existing visual place recognition pipelines encode the "whole" image and search for matches. This poses a fundamental challenge in matching two images of the same place captured from different camera viewpoints: "the similarity of what overlaps can be dominated by the dissimilarity of what does not overlap". We address this by encoding and searching for "image segments" instead of the whole images. We propose to use open-set image segmentation to decompose an image into `meaningful' entities (i.e., things and stuff). This enables us to create a novel image representation as a collection of multiple overlapping subgraphs connecting a segment with its neighboring segments, dubbed SuperSegment. Furthermore, to efficiently encode these SuperSegments into compact vector representations, we propose a novel factorized representation of feature aggregation. We show that retrieving these partial representations leads to significantly higher recognition recall than the typical whole image based retrieval. Our segments-based approach, dubbed SegVLAD, sets a new state-of-the-art in place recognition on a diverse selection of benchmark datasets, while being applicable to both generic and task-specialized image encoders. Finally, we demonstrate the potential of our method to ``revisit anything'' by evaluating our method on an object instance retrieval task, which bridges the two disparate areas of research: visual place recognition and object-goal navigation, through their common aim of recognizing goal objects specific to a place. Source code: this https URL.
准确地识别一个重新审视过的地点对于实体代理来说至关重要,以便它们进行定位和导航。这需要视觉表示具有明确的独特性,尽管在不同相机视角和场景外观之间存在强烈的变化。现有的视觉地点识别管道编码整个图像并寻找匹配。这提出了一个基本的挑战:从不同相机视角捕捉相同地点的两个图像之间的“覆盖相似性可以被差异性的不相似性所主导”。我们通过编码和寻找“图像段”来解决这个问题,而不是整个图像。我们提出,使用开集图像分割将图像分解为“有意义”的实体(即事物和物品)。这使我们能够创建一个新颖的图像表示作为由多个重叠子图连接的段与相邻段之间的集合,称之为 SuperSegment。此外,为了有效地将这些 SuperSegments 编码为紧凑的向量表示,我们提出了一个新颖的特征聚合表示。我们证明了通过检索这些部分表示,可以获得比基于检索典型整个图像更高的识别召回。我们的基于段的 approach,名为 SegVLAD,在多样化的基准数据集上实现了最新的状态,同时适用于通用和任务专用图像编码器。最后,我们通过将我们的方法应用于物体实例检索任务,通过它们共同的目标来识别特定地点的目标对象,展示了该方法的可能性和价值。源代码:https:// this URL。
https://arxiv.org/abs/2409.18049
This paper presents a novel approach to multi-robot planning and collaboration. We demonstrate a cognitive strategy for robots in human-robot teams that incorporates metacognition, natural language communication, and explainability. The system is embodied using the HARMONIC architecture that flexibly integrates cognitive and control capabilities across the team. We evaluate our approach through simulation experiments involving a joint search task by a team of heterogeneous robots (a UGV and a drone) and a human. We detail the system's handling of complex, real-world scenarios, effective action coordination between robots with different capabilities, and natural human-robot communication. This work demonstrates that the robots' ability to reason about plans, goals, and attitudes, and to provide explanations for actions and decisions are essential prerequisites for realistic human-robot teaming.
本文提出了一种新颖的多机器人规划和协作方法。我们证明了在人类机器人团队中,机器人的认知策略应包括元认知、自然语言通信和可解释性。系统采用HARMONIC架构,该架构灵活地将认知和控制能力集成在团队中。我们通过涉及由不同能力机器人组成的团队的人机仿真实验来评估我们的方法。我们详细介绍了系统在处理复杂、真实世界场景方面的处理,不同能力机器人之间的有效动作协调,以及自然人机交互。本研究证明了机器人对计划、目标和态度进行推理以及为行动和决策提供解释的能力是实现真实人类机器人团队的关键前提。
https://arxiv.org/abs/2409.18047
This paper addresses a challenging interactive task learning scenario we call rearrangement under unawareness: to manipulate a rigid-body environment in a context where the robot is unaware of a concept that's key to solving the instructed task. We propose SECURE, an interactive task learning framework designed to solve such problems by fixing a deficient domain model using embodied conversation. Through dialogue, the robot discovers and then learns to exploit unforeseen possibilities. Using SECURE, the robot not only learns from the user's corrective feedback when it makes a mistake, but it also learns to make strategic dialogue decisions for revealing useful evidence about novel concepts for solving the instructed task. Together, these abilities allow the robot to generalise to subsequent tasks using newly acquired knowledge. We demonstrate that a robot that is semantics-aware -- that is, it exploits the logical consequences of both sentence and discourse semantics in the learning and inference process -- learns to solve rearrangement under unawareness more effectively than a robot that lacks such capabilities.
本文解决了一个具有挑战性的交互式学习任务,我们称之为在不知情的情况下重新排列:在机器人不知情解决指定任务的关键概念的环境中操作刚体。我们提出了SECURE,一种交互式学习框架,通过使用 embodied conversation 修复缺失的领域模型来解决这些问题。通过对话,机器人发现并学会了利用未知的可能性,从而不仅从用户的正确反馈中学习,还学会了为揭示解决指定任务中新兴概念而做出战略对话决策。这些能力使机器人能够将新获得的知识应用于后续任务。我们证明了,在学习和推理过程中都利用了句法和会话语义知识的语义意识机器人比缺乏这种能力的机器人更有效地解决不知情的情况下的重新排列。
https://arxiv.org/abs/2409.17755
Adversarial training has achieved remarkable advancements in defending against adversarial attacks. Among them, fast adversarial training (FAT) is gaining attention for its ability to achieve competitive robustness with fewer computing resources. Existing FAT methods typically employ a uniform strategy that optimizes all training data equally without considering the influence of different examples, which leads to an imbalanced optimization. However, this imbalance remains unexplored in the field of FAT. In this paper, we conduct a comprehensive study of the imbalance issue in FAT and observe an obvious class disparity regarding their performances. This disparity could be embodied from a perspective of alignment between clean and robust accuracy. Based on the analysis, we mainly attribute the observed misalignment and disparity to the imbalanced optimization in FAT, which motivates us to optimize different training data adaptively to enhance robustness. Specifically, we take disparity and misalignment into consideration. First, we introduce self-knowledge guided regularization, which assigns differentiated regularization weights to each class based on its training state, alleviating class disparity. Additionally, we propose self-knowledge guided label relaxation, which adjusts label relaxation according to the training accuracy, alleviating the misalignment and improving robustness. By combining these methods, we formulate the Self-Knowledge Guided FAT (SKG-FAT), leveraging naturally generated knowledge during training to enhance the adversarial robustness without compromising training efficiency. Extensive experiments on four standard datasets demonstrate that the SKG-FAT improves the robustness and preserves competitive clean accuracy, outperforming the state-of-the-art methods.
对抗性训练在防御对抗性攻击方面取得了显著的进展。其中,快速对抗性训练(FAT)因能够在较少的计算资源下实现竞争力的稳健性而受到关注。现有的FAT方法通常采用一种通用的策略,优化所有训练数据,而不考虑不同样本的影响,导致不平衡优化。然而,在FAT领域中,这种不平衡尚未被探索。在本文中,我们对FAT中的不平衡问题进行全面研究,并观察到它们的性能之间存在明显的类差异。这种差异可以从一个clean和robust准确性的角度来看作体现。根据分析,我们主要将观察到的不平衡和差异归因于FAT中的不平衡优化,这激发了我们优化不同训练数据以提高稳健性的动力。具体来说,我们考虑了差异和错位。首先,我们引入了自知识引导的规范化,根据其训练状态为每个类别分配不同的规范化权重,减轻类差异。其次,我们提出了自知识引导的标签放松,根据训练精度调整标签放松,减轻错位并提高稳健性。通过结合这些方法,我们提出了 Self-Knowledge Guided FAT (SKG-FAT),利用训练过程中的自然知识来增强对抗性鲁棒性,同时不牺牲训练效率。在四个标准数据集上的广泛实验证明,SKG-FAT提高了稳健性并保留了竞争性的干净准确性,超越了最先进的方法。
https://arxiv.org/abs/2409.17589
The fusion of Large Language Models (LLMs) and robotic systems has led to a transformative paradigm in the robotic field, offering unparalleled capabilities not only in the communication domain but also in skills like multimodal input handling, high-level reasoning, and plan generation. The grounding of LLMs knowledge into the empirical world has been considered a crucial pathway to exploit the efficiency of LLMs in robotics. Nevertheless, connecting LLMs' representations to the external world with multimodal approaches or with robots' bodies is not enough to let them understand the meaning of the language they are manipulating. Taking inspiration from humans, this work draws attention to three necessary elements for an agent to grasp and experience the world. The roadmap for LLMs grounding is envisaged in an active bodily system as the reference point for experiencing the environment, a temporally structured experience for a coherent, self-related interaction with the external world, and social skills to acquire a common-grounded shared experience.
大语言模型(LLMs)与机器人的结合已经在机器人领域带来了根本性的变革,为机器人带来了在通信领域以及多模态输入处理、高级推理和计划生成等技能方面的独特能力。将LLMs的知识建立在实证世界上被认为是利用LLMs在机器人中的效率的关键途径。然而,用多模态方法将LLMs的表示与外部世界连接起来,或者将机器人与LLMs的表示连接起来,还不够让它们理解它们操作的语言的意义。从人类身上获得灵感,这项工作关注了三个实体在代理中把握和体验世界所必需的元素。LLMs grounding roadmap被设想为体验环境的活性身体系统作为参考点,一个与外部世界有连续性和自我相关性相互作用的temporally structured experience,以及获得共同起点的社交技能,以实现对世界的共同体验。
https://arxiv.org/abs/2409.16900
Japan faces many challenges related to its aging society, including increasing rates of cognitive decline in the population and a shortage of caregivers. Efforts have begun to explore solutions using artificial intelligence (AI), especially socially embodied intelligent agents and robots that can communicate with people. Yet, there has been little research on the compatibility of these agents with older adults in various everyday situations. To this end, we conducted a user study to evaluate a robot that functions as a facilitator for a group conversation protocol designed to prevent cognitive decline. We modified the robot to use backchannelling, a natural human way of speaking, to increase receptiveness of the robot and enjoyment of the group conversation experience. We conducted a cross-generational study with young adults and older adults. Qualitative analyses indicated that younger adults perceived the backchannelling version of the robot as kinder, more trustworthy, and more acceptable than the non-backchannelling robot. Finally, we found that the robot's backchannelling elicited nonverbal backchanneling in older participants.
日本面临着人口老龄化等许多挑战,包括认知衰退率不断上升和 caregivers 短缺。已经开始了使用人工智能(AI)探索解决方案的尝试,特别是社会化智能代理和可以与人交流的机器人。然而,这些代理与不同年龄段的老年人各种日常情况下的兼容性方面的研究还很少。为此,我们进行了一项用户研究来评估一个作为群体对话协议促进器功能的机器人。我们将机器人修改为使用自然人类回放,以增加机器人的可接受性和群体对话体验的乐趣。我们还进行了跨代研究,研究了年轻人和老年人的反应。定性分析表明,年轻人认为机器人回放版本比非回放版本的机器人更友好、更值得信赖和更容易接受。最后,我们发现机器人的回放引起了老年参与者的非语言回放。
https://arxiv.org/abs/2409.16899