We introduce $\infty$-THOR, a new framework for long-horizon embodied tasks that advances long-context understanding in embodied AI. $\infty$-THOR provides: (1) a generation framework for synthesizing scalable, reproducible, and unlimited long-horizon trajectories; (2) a novel embodied QA task, Needle(s) in the Embodied Haystack, where multiple scattered clues across extended trajectories test agents' long-context reasoning ability; and (3) a long-horizon dataset and benchmark suite featuring complex tasks that span hundreds of environment steps, each paired with ground-truth action sequences. To enable this capability, we explore architectural adaptations, including interleaved Goal-State-Action modeling, context extension techniques, and Context Parallelism, to equip LLM-based agents for extreme long-context reasoning and interaction. Experimental results and analyses highlight the challenges posed by our benchmark and provide insights into training strategies and model behaviors under long-horizon conditions. Our work provides a foundation for the next generation of embodied AI systems capable of robust, long-term reasoning and planning.
我们介绍了$\infty$-THOR,这是一个新的框架,旨在处理具身任务中的长时间跨度问题,并在具身人工智能中推进长上下文理解。$\infty$-THOR提供了以下内容: 1. 一个生成框架,用于合成可扩展、可重复且无限的长时间跨度轨迹; 2. 一个新的具身问答任务,“针在具身干草堆里”,其中遍布于延长轨迹中的多个散落线索测试代理的长上下文推理能力; 3. 一套包含复杂任务的长时间跨度数据集和基准套件,每个任务跨越数百个环境步骤,并配以真实动作序列。 为了实现这一功能,我们探索了架构调整,包括交错的目标-状态-行动建模、上下文扩展技术以及上下文并行性,以便为基于大语言模型(LLM)的代理提供极端长上下文推理和交互的能力。实验结果和分析突显了我们的基准带来的挑战,并提供了关于长时间跨度条件下训练策略及模型行为的见解。 我们这项工作为下一代能够进行稳健、长期推理与规划的具身人工智能系统奠定了基础。
https://arxiv.org/abs/2505.16928
Embodied AI has developed rapidly in recent years, but it is still mainly deployed in laboratories, with various distortions in the Real-world limiting its application. Traditionally, Image Quality Assessment (IQA) methods are applied to predict human preferences for distorted images; however, there is no IQA method to assess the usability of an image in embodied tasks, namely, the perceptual quality for robots. To provide accurate and reliable quality indicators for future embodied scenarios, we first propose the topic: IQA for Embodied AI. Specifically, we (1) based on the Mertonian system and meta-cognitive theory, constructed a perception-cognition-decision-execution pipeline and defined a comprehensive subjective score collection process; (2) established the Embodied-IQA database, containing over 36k reference/distorted image pairs, with more than 5m fine-grained annotations provided by Vision Language Models/Vision Language Action-models/Real-world robots; (3) trained and validated the performance of mainstream IQA methods on Embodied-IQA, demonstrating the need to develop more accurate quality indicators for Embodied AI. We sincerely hope that through evaluation, we can promote the application of Embodied AI under complex distortions in the Real-world. Project page: this https URL
近年来,具身人工智能(Embodied AI)得到了快速发展,但主要仍局限于实验室环境之中。现实世界中的种种限制因素制约了其应用范围。传统上,图像质量评估(IQA)方法被用来预测人类对失真图像的偏好;然而,并不存在能够评估图像在具身任务中可用性的IQA方法,即机器人感知质量的评估手段。为了为未来具身场景提供准确可靠的指标,我们首先提出了“面向具身AI的图像质量评估”这一主题。 具体而言: 1. 基于默顿体系和元认知理论,构建了感知-认知-决策-执行流程,并定义了一套全面的主观评分收集方法。 2. 创建了包含超过36,000对参考/失真图像、由视觉语言模型/视觉语言行动模型/现实世界机器人提供的超过5百万条精细注释的具身-IQA数据库。 3. 在具身-IQA上训练并验证主流IQA方法的表现,展示了需要开发更准确的质量指标来适应具身AI的需求。 我们真诚地希望通过评估工作能够促进在复杂失真环境下的现实世界中应用具身人工智能。项目页面:[此链接](this https URL)
https://arxiv.org/abs/2505.16815
Embodied navigation demands comprehensive scene understanding and precise spatial reasoning. While image-text models excel at interpreting pixel-level color and lighting cues, 3D-text models capture volumetric structure and spatial relationships. However, unified fusion approaches that jointly fuse 2D images, 3D point clouds, and textual instructions face challenges in limited availability of triple-modality data and difficulty resolving conflicting beliefs among modalities. In this work, we introduce CoNav, a collaborative cross-modal reasoning framework where a pretrained 3D-text model explicitly guides an image-text navigation agent by providing structured spatial-semantic knowledge to resolve ambiguities during navigation. Specifically, we introduce Cross-Modal Belief Alignment, which operationalizes this cross-modal guidance by simply sharing textual hypotheses from the 3D-text model to the navigation agent. Through lightweight fine-tuning on a small 2D-3D-text corpus, the navigation agent learns to integrate visual cues with spatial-semantic knowledge derived from the 3D-text model, enabling effective reasoning in embodied navigation. CoNav achieves significant improvements on four standard embodied navigation benchmarks (R2R, CVDN, REVERIE, SOON) and two spatial reasoning benchmarks (ScanQA, SQA3D). Moreover, under close navigation Success Rate, CoNav often generates shorter paths compared to other methods (as measured by SPL), showcasing the potential and challenges of fusing data from different modalities in embodied navigation. Project Page: this https URL
嵌入式导航需要全面的场景理解和精确的空间推理能力。虽然图像-文本模型擅长解读像素级别的颜色和光照线索,而3D-文本模型则能捕捉体积结构和空间关系。然而,联合融合2D图像、3D点云以及文本指令的数据统一融合方法在处理三模态数据稀缺性和解决不同模式之间冲突信念的难题时面临挑战。为此,我们引入了CoNav,这是一个协作跨模态推理框架,在此框架中,预训练的3D-文本模型通过提供结构化的空间语义知识来明确指导图像-文本导航代理,从而在导航过程中解决模糊性问题。 具体而言,我们提出了跨模态信念对齐方法,该方法通过简单地从3D-文本模型共享文字假设给导航代理来实现这种跨模态引导。经过轻量级的微调,在一个小型2D-3D-文本语料库上训练后,导航代理可以学习将视觉线索与从3D-文本模型衍生的空间语义知识结合起来,从而在嵌入式导航中进行有效推理。 CoNav在四个标准嵌入式导航基准(R2R, CVDN, REVERIE, SOON)和两个空间推理基准(ScanQA, SQA3D)上取得了显著的改进。此外,在接近导航成功率的情况下,与其它方法相比(通过SPL测量),CoNav通常生成更短的路径,展示了在嵌入式导航中融合不同模态数据的能力及其面临的挑战。 项目主页:[此链接](https://this-url)
https://arxiv.org/abs/2505.16663
Vision-Language-Action (VLA) models have advanced robotic control by enabling end-to-end decision-making directly from multimodal inputs. However, their tightly coupled architectures expose novel security vulnerabilities. Unlike traditional adversarial perturbations, backdoor attacks represent a stealthier, persistent, and practically significant threat-particularly under the emerging Training-as-a-Service paradigm-but remain largely unexplored in the context of VLA models. To address this gap, we propose BadVLA, a backdoor attack method based on Objective-Decoupled Optimization, which for the first time exposes the backdoor vulnerabilities of VLA models. Specifically, it consists of a two-stage process: (1) explicit feature-space separation to isolate trigger representations from benign inputs, and (2) conditional control deviations that activate only in the presence of the trigger, while preserving clean-task performance. Empirical results on multiple VLA benchmarks demonstrate that BadVLA consistently achieves near-100% attack success rates with minimal impact on clean task accuracy. Further analyses confirm its robustness against common input perturbations, task transfers, and model fine-tuning, underscoring critical security vulnerabilities in current VLA deployments. Our work offers the first systematic investigation of backdoor vulnerabilities in VLA models, highlighting an urgent need for secure and trustworthy embodied model design practices. We have released the project page at this https URL.
翻译: 基于视觉-语言-行动(VLA)模型的机器人控制技术通过从多模态输入直接进行端到端决策而取得了进步。然而,这些紧密耦合的架构暴露出了新的安全漏洞。与传统的对抗性扰动不同,后门攻击代表着一种更隐蔽、持久且在新兴的“训练即服务”(Training-as-a-Service)模式下具有实际意义的重大威胁——但在VLA模型背景下的研究还远远不够充分。为填补这一空白,我们提出了基于目标解耦优化方法的BadVLA,首次揭示了VLA模型中的后门漏洞。具体来说,它包括一个两阶段过程:(1)明确分离特征空间以隔离触发表示与正常输入,并且(2)在存在触发信号时才激活条件控制偏差,同时保持清洁任务的表现能力不受影响。多项VLA基准测试的实证结果表明,BadVLA能够始终如一地实现接近100%的成功攻击率,而对清洁任务准确度的影响微乎其微。进一步分析证实了该方法在面对常见输入扰动、任务迁移以及模型精调时都具有强大的鲁棒性,强调了当前VLA部署中存在关键的安全漏洞问题。我们的工作首次系统地调查了VLA模型中的后门安全漏洞,并突出了建立安全且值得信赖的实体化模型设计实践的紧迫需求。我们已在[项目主页](https://example.com)发布了该项目页面。 请注意:在上面的翻译中,"this https URL" 被替换为 "this https URL" 的占位符 [项目主页](https://example.com),这是因为实际链接需要由发布者提供。请使用正确的URL进行替代。
https://arxiv.org/abs/2505.16640
Large Vision-Language Models (LVLMs) have recently advanced robotic manipulation by leveraging vision for scene perception and language for instruction following. However, existing methods rely heavily on costly human-annotated training datasets, which limits their generalization and causes them to struggle in out-of-domain (OOD) scenarios, reducing real-world adaptability. To address these challenges, we propose ManipLVM-R1, a novel reinforcement learning framework that replaces traditional supervision with Reinforcement Learning using Verifiable Rewards (RLVR). By directly optimizing for task-aligned outcomes, our method enhances generalization and physical reasoning while removing the dependence on costly annotations. Specifically, we design two rule-based reward functions targeting key robotic manipulation subtasks: an Affordance Perception Reward to enhance localization of interaction regions, and a Trajectory Match Reward to ensure the physical plausibility of action paths. These rewards provide immediate feedback and impose spatial-logical constraints, encouraging the model to go beyond shallow pattern matching and instead learn deeper, more systematic reasoning about physical interactions.
大型视觉语言模型(LVLMs)最近通过利用视觉进行场景感知和语言指令的遵循,推动了机器人操控技术的进步。然而,现有的方法严重依赖于昂贵的人工标注训练数据集,这限制了它们的泛化能力,并导致在域外(OOD)场景中表现不佳,从而降低了现实世界的适应性。为了解决这些问题,我们提出了ManipLVM-R1,这是一个新颖的强化学习框架,它用基于验证奖励的强化学习(RLVR)替代传统的监督方法。通过直接针对任务相关结果进行优化,我们的方法增强了泛化能力和物理推理能力,并且减少了对昂贵标注数据的依赖。 具体而言,我们设计了两个基于规则的奖励函数,旨在解决机器人操控中的关键子任务:一个是操作感知奖励,用于增强交互区域定位;另一个是轨迹匹配奖励,以确保动作路径的物理合理性。这些奖励提供即时反馈并施加空间逻辑约束,鼓励模型超越浅层模式匹配,并学习更深层次、更具系统性的关于物理互动的理解和推理能力。
https://arxiv.org/abs/2505.16517
Embodied agents empowered by large language models (LLMs) have shown strong performance in household object rearrangement tasks. However, these tasks primarily focus on single-turn interactions with simplified instructions, which do not truly reflect the challenges of providing meaningful assistance to users. To provide personalized assistance, embodied agents must understand the unique semantics that users assign to the physical world (e.g., favorite cup, breakfast routine) by leveraging prior interaction history to interpret dynamic, real-world instructions. Yet, the effectiveness of embodied agents in utilizing memory for personalized assistance remains largely underexplored. To address this gap, we present MEMENTO, a personalized embodied agent evaluation framework designed to comprehensively assess memory utilization capabilities to provide personalized assistance. Our framework consists of a two-stage memory evaluation process design that enables quantifying the impact of memory utilization on task performance. This process enables the evaluation of agents' understanding of personalized knowledge in object rearrangement tasks by focusing on its role in goal interpretation: (1) the ability to identify target objects based on personal meaning (object semantics), and (2) the ability to infer object-location configurations from consistent user patterns, such as routines (user patterns). Our experiments across various LLMs reveal significant limitations in memory utilization, with even frontier models like GPT-4o experiencing a 30.5% performance drop when required to reference multiple memories, particularly in tasks involving user patterns. These findings, along with our detailed analyses and case studies, provide valuable insights for future research in developing more effective personalized embodied agents. Project website: this https URL
由大型语言模型(LLM)驱动的具身智能体在家庭物品重新排列任务中表现出色。然而,这些任务主要集中在单轮交互和简化指令上,这并不能真正反映向用户提供有意义帮助的挑战。为了提供个性化服务,具身智能体必须利用先前的互动历史来理解用户赋予物理世界的独特语义(例如,最喜欢的杯子、早餐习惯),并解释动态且真实的指示。然而,具身智能体在使用记忆进行个性化辅助的有效性仍被很大程度上忽视。 为了解决这一差距,我们提出了MEMENTO,这是一个用于评估个性化具身智能体的记忆利用能力的框架。该框架包括一个两阶段的记忆评估过程设计,可以量化记忆利用率对任务性能的影响。此流程通过聚焦于目标解释中个人知识理解的作用,来评价代理人在物品重新排列任务中的个性化知识理解能力:(1) 根据个人含义(物体语义)识别目标对象的能力;(2) 从一致的用户模式(如习惯)推断出对象位置配置的能力。 我们在各种LLM上进行的实验揭示了记忆利用方面的显著局限性,即使是前沿模型如GPT-4o,在需要引用多个内存的任务中也出现了高达30.5%的表现下降,尤其是在涉及用户模式的任务中。这些发现,结合我们详细的分析和案例研究,为未来开发更有效的个性化具身智能体的研究提供了宝贵的见解。 项目网站:[此处插入正确的网址链接]
https://arxiv.org/abs/2505.16348
End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our $\pi_0$ Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-$\pi_0$. Specifically, we add Vision MoE to Drive-$\pi_0$ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive-$\pi_0$.
端到端自动驾驶(E2E-AD)需要有效处理多视角感知数据,并且能够稳健地应对各种复杂驾驶场景,特别是像急转弯这样的罕见操作。近期,在大型语言模型(LLMs)中混合专家(MoE)架构的成功表明参数的专业化可以实现强大的可扩展性。在此工作中,我们提出了DriveMoE,这是一种基于MoE的新型端到端自动驾驶框架,包括专门针对场景的视觉MoE和专门针对技能的动作MoE。DriveMoE是建立在我们的$\pi_0$ 视觉-语言-动作(VLA)基准模型基础上构建的,该模型最初来自具身人工智能领域,并称之为Drive-$\pi_0$。 具体来说,我们通过训练一个路由器来根据驾驶情境动态选择相关摄像头,将视觉MoE添加到了Drive-$\pi_0$中。这种设计反映了人类驾驶认知的特点:驾驶员会选择性地关注关键的视觉线索,而不是全面处理所有视觉信息。此外,我们还通过训练另一个路由器激活针对不同驾驶行为的专业专家模块的方式增加了动作MoE。通过明确的行为专业化,DriveMoE能够在处理各种场景时避免现有模型因模式平均化而导致的问题。 在Bench2Drive闭环评估实验中,DriveMoE实现了最先进的(SOTA)性能,证明了结合视觉和行动的MoE在自动驾驶任务中的有效性。我们将发布DriveMoE和Drive-$\pi_0$ 的代码及模型。
https://arxiv.org/abs/2505.16278
Foundation models (FMs) are increasingly used to bridge language and action in embodied agents, yet the operational characteristics of different FM integration strategies remain under-explored -- particularly for complex instruction following and versatile action generation in changing environments. This paper examines three paradigms for building robotic systems: end-to-end vision-language-action (VLA) models that implicitly integrate perception and planning, and modular pipelines incorporating either vision-language models (VLMs) or multimodal large language models (LLMs). We evaluate these paradigms through two focused case studies: a complex instruction grounding task assessing fine-grained instruction understanding and cross-modal disambiguation, and an object manipulation task targeting skill transfer via VLA finetuning. Our experiments in zero-shot and few-shot settings reveal trade-offs in generalization and data efficiency. By exploring performance limits, we distill design implications for developing language-driven physical agents and outline emerging challenges and opportunities for FM-powered robotics in real-world conditions.
基础模型(FMs)在连接语言和行动的具身代理中越来越常用,然而不同基础模型集成策略的操作特性仍然有待深入研究——尤其是在复杂指令执行和多变环境下的灵活动作生成方面。本文探讨了构建机器人系统时采用的三种范式:端到端视觉-语言-行动(VLA)模型,这种模型隐含地整合了感知与规划;以及模块化管道,这些管道结合了视觉-语言模型(VLMs)或跨模态大型语言模型(LLMs)。我们通过两个聚焦案例研究来评估这些范式:一个是复杂指令定位任务,该任务旨在测试细粒度的指令理解和跨模式歧义消除能力;另一个是目标操作任务,其目的是通过VLA微调来进行技能转移。我们在零样本和少量样本设置下的实验揭示了泛化能力和数据效率之间的权衡。通过对性能极限的研究,我们提炼出了为开发以语言驱动的物理代理的设计启示,并概述了基础模型在现实世界条件下赋能机器人技术所面临的新兴挑战与机遇。
https://arxiv.org/abs/2505.15685
When humans speak, gestures help convey communicative intentions, such as adding emphasis or describing concepts. However, current co-speech gesture generation methods rely solely on superficial linguistic cues (\textit{e.g.} speech audio or text transcripts), neglecting to understand and leverage the communicative intention that underpins human gestures. This results in outputs that are rhythmically synchronized with speech but are semantically shallow. To address this gap, we introduce \textbf{Intentional-Gesture}, a novel framework that casts gesture generation as an intention-reasoning task grounded in high-level communicative functions. % First, we curate the \textbf{InG} dataset by augmenting BEAT-2 with gesture-intention annotations (\textit{i.e.}, text sentences summarizing intentions), which are automatically annotated using large vision-language models. Next, we introduce the \textbf{Intentional Gesture Motion Tokenizer} to leverage these intention annotations. It injects high-level communicative functions (\textit{e.g.}, intentions) into tokenized motion representations to enable intention-aware gesture synthesis that are both temporally aligned and semantically meaningful, achieving new state-of-the-art performance on the BEAT-2 benchmark. Our framework offers a modular foundation for expressive gesture generation in digital humans and embodied AI. Project Page: this https URL
当人类说话时,手势有助于传达沟通意图,例如增加强调或描述概念。然而,目前的同步言语生成手势方法仅依赖于表面语言线索(如语音音频或文本转录),忽略了理解并利用支撑人类手势的交流意图。这导致生成的手势虽然与讲话节奏同步,但在语义上较为浅薄。为解决这一缺口,我们引入了**Intentional-Gesture**,这是一个将手势生成视为基于高层次沟通功能的目的推理任务的新颖框架。 首先,通过增加手势-目的标注(即总结意图的文本句子),我们将BEAT-2数据集扩展为**InG**数据集,并使用大型视觉语言模型自动进行这些目的标注。接下来,我们介绍了**Intentional Gesture Motion Tokenizer**来利用这些目的标注。该方法将高层次沟通功能(如意图)注入到标记化的运动表示中,从而实现既在时间上对齐又语义上有意义的手势合成,在BEAT-2基准测试上实现了新的最先进性能。 我们的框架为数字人类和具身人工智能中的表现力手势生成提供了一个模块化基础。项目页面:[此链接](this https URL)
https://arxiv.org/abs/2505.15197
As large language models expand beyond natural language to domains such as mathematics, multimodal understanding, and embodied agents, tokens increasingly reflect metric relationships rather than purely linguistic meaning. We introduce DIST2Loss, a distance-aware framework designed to train autoregressive discrete models by leveraging predefined distance relationships among output tokens. At its core, DIST2Loss transforms continuous exponential family distributions derived from inherent distance metrics into discrete, categorical optimization targets compatible with the models' architectures. This approach enables the models to learn and preserve meaningful distance relationships during token generation while maintaining compatibility with existing architectures. Empirical evaluations show consistent performance gains in diverse multimodal applications, including visual grounding, robotic manipulation, generative reward modeling, and image generation using vector-quantized features. These improvements are most notable in low-data regimes, demonstrating DIST2Loss's strength under resource constraints.
随着大型语言模型从自然语言扩展到数学、多模态理解和具身代理等领域,标记(tokens)越来越多地反映出度量关系而非纯粹的语言意义。我们引入了DIST2Loss,这是一种距离感知框架,旨在通过利用输出标记之间预定义的距离关系来训练自回归离散模型。在核心部分,DIST2Loss将从固有距离度量中导出的连续指数族分布转换为与模型架构兼容的离散分类优化目标。这一方法使模型能够在生成令牌时学习和保留有意义的距离关系,并且同时保持与现有架构的兼容性。 实验评估显示,在包括视觉基础、机器人操作、生成奖励建模以及使用向量量化特征进行图像生成在内的多种多模态应用中,性能均有持续提升。在低数据环境下的改进尤其显著,这表明DIST2Loss在资源受限的情况下具有强大的能力。
https://arxiv.org/abs/2503.02379
Visual-Interleaved Chain-of-Thought (VI-CoT) enables MLLMs to continually update their understanding and decisions based on step-wise intermediate visual states (IVS), much like a human would, which demonstrates impressive success in various tasks, thereby leading to emerged advancements in related benchmarks. Despite promising progress, current benchmarks provide models with relatively fixed IVS, rather than free-style IVS, whch might forcibly distort the original thinking trajectories, failing to evaluate their intrinsic reasoning capabilities. More importantly, existing benchmarks neglect to systematically explore the impact factors that IVS would impart to untamed reasoning performance. To tackle above gaps, we introduce a specialized benchmark termed ViC-Bench, consisting of four representive tasks: maze navigation, jigsaw puzzle, embodied long-horizon planning, and complex counting, where each task has dedicated free-style IVS generation pipeline supporting function calls. To systematically examine VI-CoT capability, we propose a thorough evaluation suite incorporating a progressive three-stage strategy with targeted new metrics. Besides, we establish Incremental Prompting Information Injection (IPII) strategy to ablatively explore the prompting factors for VI-CoT. We extensively conduct evaluations for 18 advanced MLLMs, revealing key insights into their VI-CoT capability. Our proposed benchmark is publicly open at Huggingface.
视觉交错思维(VI-CoT)使大规模语言模型能够根据逐步的中间视觉状态(IVS)连续更新它们的理解和决策,就像人类一样。这种方法在各种任务中取得了显著的成功,并推动了相关基准的进步。尽管有了这些令人鼓舞的进步,现有的基准却向模型提供相对固定的中间视觉状态,而不是自由式的中间视觉状态,这可能会强制性地扭曲原始的思维轨迹,无法全面评估其内在的推理能力。更重要的是,目前的基准没有系统地探索IVS对未受限制的推理性能的影响因素。 为了解决上述差距,我们引入了一个专门针对这些任务设计的新基准——ViC-Bench,包括四个代表性的任务:迷宫导航、拼图游戏、具身长时规划和复杂计数。每个任务都配备了专用的自由式IVS生成管道,支持功能调用。为了系统地评估VI-CoT的能力,我们提出了一套全面的评估方案,采用逐步推进的三阶段策略,并引入了有针对性的新度量标准。此外,我们还建立了一个增量提示信息注入(IPII)策略来逐层探索对VI-CoT有影响的提示因素。 我们在18种先进的大规模语言模型上进行了广泛的测试,揭示了关于它们在VI-CoT能力方面的关键见解。我们的新基准已在Huggingface平台公开发布。
https://arxiv.org/abs/2505.14404
We present a conceptual framework for training Vision-Language Models (VLMs) to perform Visual Perspective Taking (VPT), a core capability for embodied cognition essential for Human-Robot Interaction (HRI). As a first step toward this goal, we introduce a synthetic dataset, generated in NVIDIA Omniverse, that enables supervised learning for spatial reasoning tasks. Each instance includes an RGB image, a natural language description, and a ground-truth 4X4 transformation matrix representing object pose. We focus on inferring Z-axis distance as a foundational skill, with future extensions targeting full 6 Degrees Of Freedom (DOFs) reasoning. The dataset is publicly available to support further research. This work serves as a foundational step toward embodied AI systems capable of spatial understanding in interactive human-robot scenarios.
我们提出了一种概念框架,用于训练视觉-语言模型(VLM)进行视觉视角转换(VPT),这是一种对具身认知至关重要的核心能力,并且对于人机交互(HRI)至关重要。作为实现这一目标的第一步,我们引入了一个在NVIDIA Omniverse中生成的合成数据集,该数据集支持空间推理任务的监督学习。每个实例包括一个RGB图像、一个自然语言描述以及表示对象姿态的真实值4X4变换矩阵。我们专注于推断Z轴距离作为一个基础技能,并且未来扩展将着眼于全面的6个自由度(DOFs)推理。此数据集公开提供,以支持进一步的研究。这项工作为能够理解空间关系的具身AI系统在互动的人机场景中的发展奠定了基础。
https://arxiv.org/abs/2505.14366
Artificial General Intelligence (AGI) is often envisioned as inherently embodied. With recent advances in robotics and foundational AI models, we stand at the threshold of a new era-one marked by increasingly generalized embodied AI systems. This paper contributes to the discourse by introducing a systematic taxonomy of Embodied AGI spanning five levels (L1-L5). We review existing research and challenges at the foundational stages (L1-L2) and outline the key components required to achieve higher-level capabilities (L3-L5). Building on these insights and existing technologies, we propose a conceptual framework for an L3+ robotic brain, offering both a technical outlook and a foundation for future exploration.
人工通用智能(AGI)通常被视为内在具身化的。随着机器人技术和基础人工智能模型的近期进展,我们正站在新时代的门槛上——一个由越来越普遍的具身化AI系统定义的时代。本文通过介绍五个层次(L1-L5)的具身AGI系统的全面分类,为这一讨论做出了贡献。我们回顾了在基础阶段(L1-L2)的研究现状和挑战,并概述了实现更高级能力(L3-L5)所需的关键组件。基于这些见解和现有技术,我们提出了一种概念框架,用于构建L3+级别的机器人“大脑”,提供了一个技术展望以及未来探索的基础。
https://arxiv.org/abs/2505.14235
Omnidirectional images (ODIs), with their 360° field of view, provide unparalleled spatial awareness for immersive applications like augmented reality and embodied AI. However, the capability of existing multi-modal large language models (MLLMs) to comprehend and reason about such panoramic scenes remains underexplored. This paper addresses this gap by introducing OmniVQA, the first dataset and conducting the first benchmark for omnidirectional visual question answering. Our evaluation of state-of-the-art MLLMs reveals significant limitations in handling omnidirectional visual question answering, highlighting persistent challenges in object localization, feature extraction, and hallucination suppression within panoramic contexts. These results underscore the disconnect between current MLLM capabilities and the demands of omnidirectional visual understanding, which calls for dedicated architectural or training innovations tailored to 360° imagery. Building on the OmniVQA dataset and benchmark, we further introduce a rule-based reinforcement learning method, 360-R1, based on Qwen2.5-VL-Instruct. Concretely, we modify the group relative policy optimization (GRPO) by proposing three novel reward functions: (1) reasoning process similarity reward, (2) answer semantic accuracy reward, and (3) structured format compliance reward. Extensive experiments on our OmniVQA demonstrate the superiority of our proposed method in omnidirectional space (+6% improvement).
全向图像(ODI)以其360°的视野,为增强现实和具身人工智能等沉浸式应用提供了无与伦比的空间感知能力。然而,现有的多模态大型语言模型(MLLMs)在理解和推理这种全景场景方面的能力仍有待探索。本文通过介绍OmniVQA——首个用于全向视觉问答的数据集及基准测试来填补这一空白。我们对当前最先进的MLLM的评估显示,在处理全向视觉问答时存在显著局限性,尤其是在对象定位、特征提取以及抑制全景上下文中的幻觉生成方面仍然面临挑战。这些结果强调了现有MLLM能力与全向视觉理解需求之间的差距,并呼吁开发针对360°图像专门设计的架构或训练创新。 基于OmniVQA数据集和基准测试,我们进一步提出了一种基于Qwen2.5-VL-Instruct的规则强化学习方法——360-R1。具体而言,通过提议三种新颖的奖励函数来改进群相对策略优化(GRPO):(1)推理过程相似性奖励;(2)答案语义准确性奖励;以及(3)结构化格式合规性奖励。在我们OmniVQA数据集上的广泛实验表明,我们的方法在全向空间中表现出优越性能(+6%的改进)。
https://arxiv.org/abs/2505.14197
Evolution and learning have historically been interrelated topics, and their interplay is attracting increased interest lately. The emerging new factor in this trend is morphological evolution, the evolution of physical forms within embodied AI systems such as robots. In this study, we investigate a system of hexacopter-type drones with evolvable morphologies and learnable controllers and make contributions to two fields. For aerial robotics, we demonstrate that the combination of evolution and learning can deliver non-conventional drones that significantly outperform the traditional hexacopter on several tasks that are more complex than previously considered in the literature. For the field of Evolutionary Computing, we introduce novel metrics and perform new analyses into the interaction of morphological evolution and learning, uncovering hitherto unidentified effects. Our analysis tools are domain-agnostic, making a methodological contribution towards building solid foundations for embodied AI systems that integrate evolution and learning.
进化和学习在历史上一直是相互关联的主题,最近人们对它们的交互越来越感兴趣。这一趋势中的新兴因素是形态进化,即身体化人工智能系统(如机器人)内部物理形式的演变。在这项研究中,我们调查了一种具有可进化的形态和可学习控制器的六旋翼无人机系统,并对两个领域做出了贡献。 对于空中机器人技术,我们展示了进化与学习相结合能够产生非传统的无人机,在一些比以往文献中更复杂的任务上表现出显著优于传统六旋翼机的效果。而对于进化计算领域,我们引入了新的度量标准,并进行了一系列新分析,探讨形态进化和学习之间的相互作用,揭示了一些前所未有的影响。 我们的分析工具是领域无关的,为构建融合进化与学习的具身AI系统奠定了方法论基础,作出了贡献。
https://arxiv.org/abs/2505.14129
Embodied Question Answering (EQA) requires agents to autonomously explore and understand the environment to answer context-dependent questions. Existing frameworks typically center around the planner, which guides the stopping module, memory module, and answering module for reasoning. In this paper, we propose a memory-centric EQA framework named MemoryEQA. Unlike planner-centric EQA models where the memory module cannot fully interact with other modules, MemoryEQA flexible feeds memory information into all modules, thereby enhancing efficiency and accuracy in handling complex tasks, such as those involving multiple targets across different regions. Specifically, we establish a multi-modal hierarchical memory mechanism, which is divided into global memory that stores language-enhanced scene maps, and local memory that retains historical observations and state information. When performing EQA tasks, the multi-modal large language model is leveraged to convert memory information into the required input formats for injection into different modules. To evaluate EQA models' memory capabilities, we constructed the MT-HM3D dataset based on HM3D, comprising 1,587 question-answer pairs involving multiple targets across various regions, which requires agents to maintain memory of exploration-acquired target information. Experimental results on HM-EQA, MT-HM3D, and OpenEQA demonstrate the effectiveness of our framework, where a 19.8% performance gain on MT-HM3D compared to baseline model further underscores memory capability's pivotal role in resolving complex tasks.
Embodied Question Answering(实体化问题回答,EQA)要求代理自主探索和理解环境以回答上下文相关的问题。现有的框架通常围绕规划器构建,该规划器引导停止模块、记忆模块和回答模块进行推理。在本文中,我们提出了一种名为MemoryEQA的记忆中心型EQA框架。与以规划器为中心的EQA模型不同,其中记忆模块无法与其他模块充分交互,MemoryEQA灵活地将记忆信息输入所有模块,从而提高了处理复杂任务(如涉及不同区域多个目标的任务)时的效率和准确性。 具体来说,我们建立了一种多模态分层记忆机制,该机制分为存储增强语言场景图的全局内存以及保留历史观察和状态信息的局部内存。在执行EQA任务时,利用多模态大型语言模型将内存信息转换为不同模块所需的输入格式以注入其中。 为了评估EQA模型的记忆能力,我们基于HM3D构建了MT-HM3D数据集,该数据集中包含1,587个涉及多个目标的问答对,并且这些目标分布在不同的区域。这要求代理保持探索过程中获取的目标信息记忆。在HM-EQA、MT-HM3D和OpenEQA上的实验结果证明了我们框架的有效性,在MT-HM3D上与基线模型相比,MemoryEQA模型实现了19.8%的性能提升,进一步强调了记忆能力解决复杂任务的关键作用。
https://arxiv.org/abs/2505.13948
Existing benchmarks that assess Language Models (LMs) as Language Agents (LAs) for tool use primarily focus on stateless, single-turn interactions or partial evaluations, such as tool selection in a single turn, overlooking the inherent stateful nature of interactions in multi-turn applications. To fulfill this gap, we propose \texttt{DialogTool}, a multi-turn dialogue dataset with stateful tool interactions considering the whole life cycle of tool use, across six key tasks in three stages: 1) \textit{tool creation}; 2) \textit{tool utilization}: tool awareness, tool selection, tool execution; and 3) \textit{role-consistent response}: response generation and role play. Furthermore, we build \texttt{VirtualMobile} -- an embodied virtual mobile evaluation environment to simulate API calls and assess the robustness of the created APIs\footnote{We will use tools and APIs alternatively, there are no significant differences between them in this paper.}. Taking advantage of these artifacts, we conduct comprehensive evaluation on 13 distinct open- and closed-source LLMs and provide detailed analysis at each stage, revealing that the existing state-of-the-art LLMs still cannot perform well to use tools over long horizons.
现有的评估语言模型(LM)作为语言代理(LA)在工具使用方面的能力的基准测试主要集中在无状态、单一回合交互或部分评估上,例如单次回合内的工具选择,忽视了多轮应用中互动固有的有状态性质。为了填补这一空白,我们提出\texttt{DialogTool},这是一个多轮对话数据集,包含了工具使用的整个生命周期中的有状态工具互动,涵盖了六个关键任务,在三个阶段内:1)\textit{工具创建}; 2) \textit{工具利用}: 工具意识、工具选择和工具执行;3) \textit{角色一致的响应}: 响应生成和角色扮演。此外,我们构建了\texttt{VirtualMobile}——一个模拟API调用并评估所创造API的鲁棒性的身临其境虚拟移动评估环境(在本文中我们将交替使用工具和API这两个术语,并且它们在此文中没有显著差异)。利用这些资源,我们在13种不同的开源和闭源大规模语言模型上进行了全面评估,并提供了每个阶段的详细分析,揭示了现有的最先进的LLM仍然无法很好地处理长时间跨度内的工具使用。
https://arxiv.org/abs/2505.13328
Early cavemen relied on gestures, vocalizations, and simple signals to coordinate, plan, avoid predators, and share resources. Today, humans collaborate using complex languages to achieve remarkable results. What drives this evolution in communication? How does language emerge, adapt, and become vital for teamwork? Understanding the origins of language remains a challenge. A leading hypothesis in linguistics and anthropology posits that language evolved to meet the ecological and social demands of early human cooperation. Language did not arise in isolation, but through shared survival goals. Inspired by this view, we investigate the emergence of language in multi-agent Foraging Games. These environments are designed to reflect the cognitive and ecological constraints believed to have influenced the evolution of communication. Agents operate in a shared grid world with only partial knowledge about other agents and the environment, and must coordinate to complete games like picking up high-value targets or executing temporally ordered actions. Using end-to-end deep reinforcement learning, agents learn both actions and communication strategies from scratch. We find that agents develop communication protocols with hallmark features of natural language: arbitrariness, interchangeability, displacement, cultural transmission, and compositionality. We quantify each property and analyze how different factors, such as population size and temporal dependencies, shape specific aspects of the emergent language. Our framework serves as a platform for studying how language can evolve from partial observability, temporal reasoning, and cooperative goals in embodied multi-agent settings. We will release all data, code, and models publicly.
早期的人类穴居人依靠手势、声音和简单的信号来协调行动、规划任务、避免捕食者以及共享资源。而今天,人类则使用复杂的语言来进行协作并取得显著成果。推动这种沟通演变的动力是什么?语言是如何产生、适应并在团队合作中变得至关重要呢?了解语言的起源仍然是一个挑战。语言学和人类学中的一个主流假设认为,语言是为了满足早期人类合作所面临的生态和社会需求而进化的。语言并非孤立地形成,而是通过共享生存目标共同发展起来的。 受这一观点启发,我们研究了多智能体觅食游戏中语言的出现。这些环境的设计旨在反映被认为是影响沟通进化的心智和生态约束条件。代理人在一个共享的网格世界中运作,仅对其他代理人及周围环境具有部分了解,并且必须通过协调来完成诸如拾取高价值目标或执行时间有序行动等游戏任务。 我们利用端到端深度强化学习方法让智能体从零开始学习动作策略和沟通策略。发现这些代理会发展出具备自然语言特征的通信协议:任意性、互换性、位移性、文化传播以及组合性。我们量化了每种属性,并分析不同因素,例如群体规模和时间依赖性如何影响新兴语言的具体方面。 我们的框架为研究如何在具有部分可观察性和合作目标的身体多智能体环境中进化出沟通提供了一个平台。我们将公开发布所有数据、代码及模型。
https://arxiv.org/abs/2505.12872
Teleoperation is a cornerstone of embodied-robot learning, and bimanual dexterous teleoperation in particular provides rich demonstrations that are difficult to obtain with fully autonomous systems. While recent studies have proposed diverse hardware pipelines-ranging from inertial motion-capture gloves to exoskeletons and vision-based interfaces-there is still no unified benchmark that enables fair, reproducible comparison of these systems. In this paper, we introduce TeleOpBench, a simulator-centric benchmark tailored to bimanual dexterous teleoperation. TeleOpBench contains 30 high-fidelity task environments that span pick-and-place, tool use, and collaborative manipulation, covering a broad spectrum of kinematic and force-interaction difficulty. Within this benchmark we implement four representative teleoperation modalities-(i) MoCap, (ii) VR device, (iii) arm-hand exoskeletons, and (iv) monocular vision tracking-and evaluate them with a common protocol and metric suite. To validate that performance in simulation is predictive of real-world behavior, we conduct mirrored experiments on a physical dual-arm platform equipped with two 6-DoF dexterous hands. Across 10 held-out tasks we observe a strong correlation between simulator and hardware performance, confirming the external validity of TeleOpBench. TeleOpBench establishes a common yardstick for teleoperation research and provides an extensible platform for future algorithmic and hardware innovation.
远程操作是具身机器人学习的基石,而双手灵巧远程操作尤其提供了难以用完全自主系统获得的丰富演示。虽然最近的研究提出了各种硬件管道——从惯性动作捕捉手套到外骨骼和基于视觉的界面——但至今仍未有一个统一的基准来公平、可重复地比较这些系统。在这篇论文中,我们介绍了TeleOpBench,这是一个以模拟器为中心的基准测试平台,专为双手灵巧远程操作设计。TeleOpBench包含30个高保真的任务环境,涵盖了抓取和放置、工具使用以及协作操纵等领域,并且覆盖了广泛的运动学和力交互难度。 在该基准中,我们实现了四种代表性的远程操作系统——(i) 动作捕捉 (MoCap),(ii) 虚拟现实设备,(iii) 臂手外骨骼,和(iv) 单目视觉追踪,并使用统一的协议和评估指标对其进行评价。为了验证模拟器中的性能是否可以预测真实世界的行为表现,我们在配备两个6自由度灵巧机械手的物理双臂平台中进行了对称实验。在10个保留的任务上,我们观察到了模拟器与硬件性能之间存在很强的相关性,这确认了TeleOpBench的外部有效性。 TeleOpBench为远程操作研究建立了统一的标准,并提供了未来算法和硬件创新的一个可扩展平台。
https://arxiv.org/abs/2505.12748
Advances in deep generative modelling have made it increasingly plausible to train human-level embodied agents. Yet progress has been limited by the absence of large-scale, real-time, multi-modal, and socially interactive datasets that reflect the sensory-motor complexity of natural environments. To address this, we present PLAICraft, a novel data collection platform and dataset capturing multiplayer Minecraft interactions across five time-aligned modalities: video, game output audio, microphone input audio, mouse, and keyboard actions. Each modality is logged with millisecond time precision, enabling the study of synchronous, embodied behaviour in a rich, open-ended world. The dataset comprises over 10,000 hours of gameplay from more than 10,000 global participants.\footnote{We have done a privacy review for the public release of an initial 200-hour subset of the dataset, with plans to release most of the dataset over time.} Alongside the dataset, we provide an evaluation suite for benchmarking model capabilities in object recognition, spatial awareness, language grounding, and long-term memory. PLAICraft opens a path toward training and evaluating agents that act fluently and purposefully in real time, paving the way for truly embodied artificial intelligence.
深度生成模型的进步使得训练具有人类水平的具身智能体变得越来越可行。然而,由于缺乏大规模、实时、多模态且能够体现自然环境感官-运动复杂性的社会互动数据集,这一领域的进展受到了限制。为了解决这个问题,我们提出了PLAICraft,这是一个新颖的数据采集平台和数据集,捕捉了多人《我的世界》(Minecraft)游戏跨五个时间对齐的模式之间的交互:视频、游戏输出音频、麦克风输入音频、鼠标操作和键盘动作。每个模态都以毫秒级的时间精度记录下来,从而能够研究在丰富的开放环境中同步发生的具身行为。该数据集包含了超过10,000小时的游戏时长,由来自全球的超过10,000名参与者贡献。 \footnote{我们已经对首批公开发布的200小时子集进行了隐私审查,并计划随着时间推移逐步发布整个数据集的大部分内容。} 除了数据集外,我们还提供了一个评估套件,用于测试模型在物体识别、空间感知、语言理解和长期记忆方面的性能。PLAICraft为训练和评估能够实时流畅且有目的行动的智能体铺平了道路,这是迈向真正具身的人工智能的重要一步。
https://arxiv.org/abs/2505.12707