Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.
空间追踪作为机器人本体互动的基本能力,由于其需要进行多步骤的度量推理并结合复杂的空间指代和现实世界的度量测量,因而具有固有的挑战性。然而,现有的方法难以处理这一组合任务。为此,我们提出了RoboTracer,这是一种3D感知视觉语言模型(VLM),它首先通过一个通用的空间编码器和受回归监督的解码器来实现空间指代与度量,从而在有监督微调(SFT)期间增强对尺度的认识。此外,RoboTracer通过带有度量子敏感过程奖励的强化学习微调(RFT)进一步推进了多步骤度量推理,并指导关键中间感知线索以准确生成空间轨迹。 为了支持SFT和RFT训练,我们引入了TraceSpatial,这是一个包含30M问题-答案对的大规模数据集,涵盖了户外、室内和平面场景,并且能够支持复杂的推理过程(多达9步)。此外,我们还推出了TraceSpatial-Bench,这是用于评估空间追踪性能的具有挑战性的基准测试,填补了现有评价方法的空白。实验结果表明,在空间理解、度量和指代方面,RoboTracer超越了基线模型,并在TraceSpatial-Bench上表现出色,大幅优于Gemini-2.5-Pro,准确性提高了36%。 值得注意的是,RoboTracer可以与各种控制策略集成在一起,以执行跨多种机器人(UR5、G1人形)的复杂现实场景中的长期动态任务。
https://arxiv.org/abs/2512.13660
We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating physically plausible scene transformations driven by real-world actions. Unlike prior work focused on object-level edits, Do-Undo requires models to simulate the outcome of a physical action and then accurately reverse it, reflecting true cause-and-effect in the visual world. We curate a large-scale dataset of reversible actions from real-world videos and design a training strategy enforcing consistency for robust action grounding. Our experiments reveal that current models struggle with physical reversibility, underscoring the importance of this task for embodied AI, robotics, and physics-aware generative modeling. Do-Undo establishes an intuitive testbed for evaluating and advancing physical reasoning in multimodal systems.
我们引入了Do-Undo任务和基准,以解决视觉语言模型的一个关键缺口:理解并生成由真实世界动作驱动的物理上合理的场景变换。与以往专注于对象级别编辑的工作不同,Do-Undo要求模型模拟一个物理行动的结果,并准确地反向操作,这反映了现实世界的因果关系。我们从真实世界的视频中整理了一个大规模可逆行为的数据集,并设计了一种训练策略来强制执行一致性以增强动作定位的稳健性。我们的实验表明,目前的模型在处理物理上的可逆性时存在困难,突显了这一任务对于具身AI、机器人技术和物理学感知生成建模的重要性。Do-Undo为评估和推进多模态系统中的物理推理提供了一个直观的测试平台。
https://arxiv.org/abs/2512.13609
Vision Language Models (VLMs) excel at visual question answering (VQA) but remain limited to snapshot vision, reasoning from static images. In contrast, embodied agents require ambulatory vision, actively moving to obtain more informative views. We introduce Visually Grounded Active View Selection (VG-AVS), a task that selects the most informative next viewpoint using only the visual information in the current image, without relying on scene memory or external knowledge. To support this task, we construct a synthetic dataset with automatically generated paired query-target views and question-answer prompts. We also propose a framework that fine-tunes pretrained VLMs through supervised fine-tuning (SFT) followed by RL-based policy optimization. Our approach achieves strong question answering performance based on viewpoint selection and generalizes robustly to unseen synthetic and real scenes. Furthermore, incorporating our learned VG-AVS framework into existing scene-exploration-based EQA systems improves downstream question-answering accuracy.
视觉语言模型(VLMs)在视觉问答(VQA)方面表现出色,但它们仍然局限于快照式的视觉理解,即仅从静态图像中进行推理。相比之下,具身智能体则需要移动性视角(ambulatory vision),主动移动以获取更多有信息量的视图。我们引入了“基于视觉定位的主动视点选择”(VG-AVS)任务,该任务使用当前图像中的视觉信息来选择下一个最具有信息量的视点,并且不依赖于场景记忆或外部知识。 为了支持这项任务,我们构建了一个合成数据集,其中包括自动生成的配对查询-目标视图和问题-答案提示。我们还提出了一种框架,该框架通过监督微调(SFT)以及基于强化学习(RL)的策略优化来精炼预训练的VLMs。 我们的方法在视点选择的基础上实现了强大的问答性能,并且能够稳健地泛化到未见过的合成和真实场景中。此外,将我们所学得的VG-AVS框架整合进现有的基于环境探索的EQAs(例如Embodied Question Answering)系统中,可以提高下游问题回答准确性。
https://arxiv.org/abs/2512.13250
While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (i.e., understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (i.e., world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction models). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.
虽然通用具身代理必须作为一个统一系统运行,但目前的方法是基于用于理解、世界建模和控制的孤立模型构建的。这种碎片化阻碍了多模态生成能力的一体化,并且妨碍从大规模异构数据中学习。在本文中,我们提出了Motus,这是一种统一的动作潜变量世界模型,它利用现有的通用预训练模型和丰富的可共享运动信息。Motus引入了一种Transformer混合架构(MoT),以整合三个专家领域(即理解、视频生成和动作)并采用类似UniDiffuser的调度器来实现在不同建模模式之间的灵活切换(例如:世界模型、视觉-语言-行动模型、逆动力学模型、视频生成模型以及视频-行动联合预测模型)。此外,Motus利用光学流学习潜变量动作,并采用了包含三个阶段训练流程和六层数据金字塔的配方方法,从而提取像素级别的“delta动作”并实现大规模的动作预训练。实验表明,在模拟环境(相对于X-VLA性能提高15%,相较于Pi0.5则提高了45%)以及真实世界场景中(性能提升范围为+11至+48%),Motus的表现优于最先进的方法,证明了一体化模型能够统一所有功能和先验知识,显著地有利于下游的机器人任务。
https://arxiv.org/abs/2512.13030
Embodied agents face a critical dilemma that end-to-end models lack interpretability and explicit 3D reasoning, while modular systems ignore cross-component interdependencies and synergies. To bridge this gap, we propose the Dynamic 3D Vision-Language-Planning Model (D3D-VLP). Our model introduces two key innovations: 1) A Dynamic 3D Chain-of-Thought (3D CoT) that unifies planning, grounding, navigation, and question answering within a single 3D-VLM and CoT pipeline; 2) A Synergistic Learning from Fragmented Supervision (SLFS) strategy, which uses a masked autoregressive loss to learn from massive and partially-annotated hybrid data. This allows different CoT components to mutually reinforce and implicitly supervise each other. To this end, we construct a large-scale dataset with 10M hybrid samples from 5K real scans and 20K synthetic scenes that are compatible with online learning methods such as RL and DAgger. Our D3D-VLP achieves state-of-the-art results on multiple benchmarks, including Vision-and-Language Navigation (R2R-CE, REVERIE-CE, NavRAG-CE), Object-goal Navigation (HM3D-OVON), and Task-oriented Sequential Grounding and Navigation (SG3D). Real-world mobile manipulation experiments further validate the effectiveness.
嵌入式代理面临一个关键的困境:端到端模型缺乏可解释性和明确的三维推理能力,而模块化系统则忽略了跨组件的相互依赖关系和协同作用。为了解决这一差距,我们提出了一种动态三维视觉-语言-规划模型(D3D-VLP)。我们的模型引入了两项关键创新: 1. 动态三维思维链(3D CoT),它在单一的3D-VLM和CoT管道内统一了计划、定位、导航以及问答功能。 2. 协同学习片段监督(SLFS)策略,该策略利用掩码自回归损失从大量且部分标注的混合数据中进行学习。这使得不同的CoT组件能够相互强化,并隐式地互相监督。 为此,我们构建了一个大规模的数据集,包含来自5000个真实扫描和2万个合成场景中的10M混合样本,这些数据兼容在线学习方法(如RL和DAgger)。我们的D3D-VLP模型在多个基准测试中取得了最先进的结果,包括视觉与语言导航(R2R-CE、REVERIE-CE、NavRAG-CE)、目标对象导航(HM3D-OVON)以及面向任务的顺序定位和导航(SG3D)。实际环境中的移动操作实验进一步验证了该模型的有效性。
https://arxiv.org/abs/2512.12622
Generating articulated assets is crucial for robotics, digital twins, and embodied intelligence. Existing generative models often rely on single-view inputs representing closed states, resulting in ambiguous or unrealistic kinematic structures due to the entanglement between geometric shape and joint dynamics. To address these challenges, we introduce ArtGen, a conditional diffusion-based framework capable of generating articulated 3D objects with accurate geometry and coherent kinematics from single-view images or text descriptions at arbitrary part-level states. Specifically, ArtGen employs cross-state Monte Carlo sampling to explicitly enforce global kinematic consistency, reducing structural-motion entanglement. Additionally, we integrate a Chain-of-Thought reasoning module to infer robust structural priors, such as part semantics, joint types, and connectivity, guiding a sparse-expert Diffusion Transformer to specialize in diverse kinematic interactions. Furthermore, a compositional 3D-VAE latent prior enhanced with local-global attention effectively captures fine-grained geometry and global part-level relationships. Extensive experiments on the PartNet-Mobility benchmark demonstrate that ArtGen significantly outperforms state-of-the-art methods.
生成具有关节结构的资产对于机器人技术、数字孪生和具身智能至关重要。现有的生成模型通常依赖于单一视角表示封闭状态的输入,这会导致由于几何形状与关节动力学纠缠而导致的模棱两可或不现实的动力学结构。为了解决这些挑战,我们引入了ArtGen,这是一种基于条件扩散的框架,能够从单视图图像或文本描述中生成具有准确几何形状和连贯动力学特性的复杂3D对象,并且能够在任意部件级状态下操作。 具体而言,ArtGen采用跨状态蒙特卡洛采样以明确地强制执行全局动力学一致性,从而减少结构-运动纠缠。此外,我们集成了一种Chain-of-Thought推理模块来推断稳健的结构先验知识,如部件语义、关节类型和连接性,指导稀疏专家扩散变换器专注于各种动力学交互。另外,一个由局部-全局注意力增强的组合3D-VAE潜在先验有效地捕获细粒度几何形状和全局部件级关系。 在PartNet-Mobility基准上的广泛实验表明,ArtGen显著优于现有的最先进方法。
https://arxiv.org/abs/2512.12395
Understanding camera motion is a fundamental problem in embodied perception and 3D scene understanding. While visual methods have advanced rapidly, they often struggle under visually degraded conditions such as motion blur or occlusions. In this work, we show that passive scene sounds provide complementary cues for relative camera pose estimation for in-the-wild videos. We introduce a simple but effective audio-visual framework that integrates direction-ofarrival (DOA) spectra and binauralized embeddings into a state-of-the-art vision-only pose estimation model. Our results on two large datasets show consistent gains over strong visual baselines, plus robustness when the visual information is corrupted. To our knowledge, this represents the first work to successfully leverage audio for relative camera pose estimation in real-world videos, and it establishes incidental, everyday audio as an unexpected but promising signal for a classic spatial challenge. Project: this http URL.
理解摄像机运动是具身感知和三维场景理解中的一个基本问题。尽管视觉方法已经取得了快速的进步,但它们在诸如运动模糊或遮挡等视觉降级条件下经常遇到困难。在这项工作中,我们展示了被动的环境声音为野外视频中相对相机姿态估计提供了补充线索。我们引入了一个简单而有效的音频-视觉框架,该框架将到达方向(DOA)光谱和双耳化嵌入集成到最先进的仅基于视觉的姿态估计模型中。我们在两个大规模数据集上的结果表明,与强大的视觉基线相比有持续的改进,并且当视觉信息被破坏时表现出良好的鲁棒性。据我们所知,这是首次成功利用音频进行真实世界视频相对相机姿态估计的工作,它将意外但具有前景的一般环境声音确立为经典的空间挑战中的信号。 项目地址:[请在此处插入实际链接](原文提到的“this http URL”应是一个具体的网址或项目页面链接)
https://arxiv.org/abs/2512.12165
Image Compression for Machines (ICM) has emerged as a pivotal research direction in the field of visual data compression. However, with the rapid evolution of machine intelligence, the target of compression has shifted from task-specific virtual models to Embodied agents operating in real-world environments. To address the communication constraints of Embodied AI in multi-agent systems and ensure real-time task execution, this paper introduces, for the first time, the scientific problem of Embodied Image Compression. We establish a standardized benchmark, EmbodiedComp, to facilitate systematic evaluation under ultra-low bitrate conditions in a closed-loop setting. Through extensive empirical studies in both simulated and real-world settings, we demonstrate that existing Vision-Language-Action models (VLAs) fail to reliably perform even simple manipulation tasks when compressed below the Embodied bitrate threshold. We anticipate that EmbodiedComp will catalyze the development of domain-specific compression tailored for Embodied agents , thereby accelerating the Embodied AI deployment in the Real-world.
机器图像压缩(ICM)已经成为视觉数据压缩领域中的一个关键研究方向。然而,随着机器智能的快速发展,压缩的目标已经从特定任务的虚拟模型转向了在现实环境中运作的具身代理(Embodied agents)。为了应对多代理系统中具身AI通信约束并确保实时任务执行,本文首次提出了具身图像压缩这一科学问题。我们建立了一个标准化基准测试平台EmbodiedComp,在闭环设置下的极低比特率条件下进行系统的评估。通过在模拟和现实世界环境中的广泛实证研究,我们发现现有的视觉-语言-行动模型(VLAs)在低于具身比特率阈值时无法可靠地执行简单的操作任务。我们认为,EmbodiedComp将推动针对具身代理的领域特定压缩技术的发展,从而加速具身AI在现实世界的部署。
https://arxiv.org/abs/2512.11612
In embodied intelligence, the embodiment gap between robotic and human hands brings significant challenges for learning from human demonstrations. Although some studies have attempted to bridge this gap using reinforcement learning, they remain confined to merely reproducing human manipulation, resulting in limited task performance. In this paper, we propose UniBYD, a unified framework that uses a dynamic reinforcement learning algorithm to discover manipulation policies aligned with the robot's physical characteristics. To enable consistent modeling across diverse robotic hand morphologies, UniBYD incorporates a unified morphological representation (UMR). Building on UMR, we design a dynamic PPO with an annealed reward schedule, enabling reinforcement learning to transition from imitation of human demonstrations to explore policies adapted to diverse robotic morphologies better, thereby going beyond mere imitation of human hands. To address the frequent failures of learning human priors in the early training stage, we design a hybrid Markov-based shadow engine that enables reinforcement learning to imitate human manipulations in a fine-grained manner. To evaluate UniBYD comprehensively, we propose UniManip, the first benchmark encompassing robotic manipulation tasks spanning multiple hand morphologies. Experiments demonstrate a 67.90% improvement in success rate over the current state-of-the-art. Upon acceptance of the paper, we will release our code and benchmark at this https URL.
在具身智能领域,机器人手与人类手之间的实体化差距为从人类演示中学习带来了重大挑战。尽管有一些研究尝试使用强化学习来弥合这一差距,但它们仍然局限于简单地复制人类的操作行为,导致任务性能受限。为此,在本文中我们提出了UniBYD,这是一个统一的框架,它采用动态强化学习算法来发现与机器人物理特性相匹配的操作策略。为了实现在不同机械手形态之间保持一致建模,UniBYD集成了统一形态表示(UMR)。基于UMR,我们设计了一种具有退火奖励计划的动态PPO(Proximal Policy Optimization),使强化学习能够从模仿人类演示过渡到探索适应多种机器人形态的操作策略,从而超越单纯模仿人手。为了应对在早期训练阶段难以有效学习人类先验知识的问题,我们设计了一个基于混合马尔可夫模型的影子引擎,使强化学习能够在细粒度层面模拟人类操作行为。 为全面评估UniBYD的表现,我们提出了UniManip,这是一个涵盖了跨越多种机械手形态的机器人操控任务的第一个基准。实验结果表明,在成功率上相比当前最先进方法有了67.90%的进步。一旦论文被接受,我们将在此网址发布我们的代码和基准:[此链接](在实际环境中,你需要将这里的“此链接”替换为具体的URL地址)。
https://arxiv.org/abs/2512.11609
Vision-Language-Action (VLA) models are driving a revolution in robotics, enabling machines to understand instructions and interact with the physical world. This field is exploding with new models and datasets, making it both exciting and challenging to keep pace with. This survey offers a clear and structured guide to the VLA landscape. We design it to follow the natural learning path of a researcher: we start with the basic Modules of any VLA model, trace the history through key Milestones, and then dive deep into the core Challenges that define recent research frontier. Our main contribution is a detailed breakdown of the five biggest challenges in: (1) Representation, (2) Execution, (3) Generalization, (4) Safety, and (5) Dataset and Evaluation. This structure mirrors the developmental roadmap of a generalist agent: establishing the fundamental perception-action loop, scaling capabilities across diverse embodiments and environments, and finally ensuring trustworthy deployment-all supported by the essential data infrastructure. For each of them, we review existing approaches and highlight future opportunities. We position this paper as both a foundational guide for newcomers and a strategic roadmap for experienced researchers, with the dual aim of accelerating learning and inspiring new ideas in embodied intelligence. A live version of this survey, with continuous updates, is maintained on our \href{this https URL}{project page}.
翻译: 视觉-语言-行动(VLA)模型正在推动机器人领域的革命,使机器能够理解指令并与物理世界互动。这一领域的新模型和数据集层出不穷,使之既令人兴奋又充满挑战。本综述为VLA领域的复杂格局提供了一个清晰且结构化的指南。我们设计了符合研究人员自然学习路径的框架:从任何VLA模型的基本模块开始,追溯历史上的关键里程碑,并深入探讨定义最近研究前沿的核心挑战。我们的主要贡献在于详细剖析了以下五个最大挑战:(1)表示;(2)执行;(3)泛化;(4)安全性和(5)数据集与评估。这种结构反映了通用智能体发展的路线图:建立基本的感知-行动循环,扩展多样化的体现和环境中的能力,并最终确保可信部署——所有这些都离不开关键的数据基础设施支持。对于每一个挑战,我们回顾了现有的方法并指出了未来的机会。 本文旨在同时为初学者提供基础指南以及为有经验的研究人员制定战略路线图,目的是加速学习过程并激发新思想在具身智能领域的应用。一份持续更新的在线版本可在我们的项目页面上找到([点击此处查看](https://this https URL))。
https://arxiv.org/abs/2512.11362
Generating controllable and interactive indoor scenes is fundamental to applications in game development, architectural visualization, and embodied AI training. Yet existing approaches either handle a narrow range of input modalities or rely on stochastic processes that hinder controllability. To overcome these limitations, we introduce RoomPilot, a unified framework that parses diverse multi-modal inputs--textual descriptions or CAD floor plans--into an Indoor Domain-Specific Language (IDSL) for indoor structured scene generation. The key insight is that a well-designed IDSL can act as a shared semantic representation, enabling coherent, high-quality scene synthesis from any single modality while maintaining interaction semantics. In contrast to conventional procedural methods that produce visually plausible but functionally inert layouts, RoomPilot leverages a curated dataset of interaction-annotated assets to synthesize environments exhibiting realistic object behaviors. Extensive experiments further validate its strong multi-modal understanding, fine-grained controllability in scene generation, and superior physical consistency and visual fidelity, marking a significant step toward general-purpose controllable 3D indoor scene generation.
生成可控且互动的室内场景对于游戏开发、建筑可视化以及具身人工智能训练等应用来说至关重要。然而,现有的方法要么只能处理有限种类的输入模态(如文本描述或CAD平面图),要么依赖于随机过程来创建这些场景,这使得难以实现精确控制。为了克服这些问题,我们引入了RoomPilot——一个统一框架,它能够解析多种多样的多模态输入(包括文本描述和CAD楼层计划)并将其转换为室内专用领域的语言(IDSL),进而生成结构化的室内场景。关键的理念是:经过精心设计的IDSL可以作为共享语义表示,从单一模态中生成一致且高质量的场景,并保持交互性。 与传统的程序化方法不同,这些方法虽然能够产生视觉上可信但功能相对静止的布局,RoomPilot利用一个注有互动数据的数据集来合成表现出真实物体行为的环境。大量的实验进一步验证了它在多模态理解、场景生成中的精细控制以及物理一致性和视觉逼真度方面的卓越表现,标志着向着通用可控3D室内场景生成迈出了一大步。
https://arxiv.org/abs/2512.11234
Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects -- Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference -- jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. To align objective metrics with human judgment, we further construct WorldLens-26K, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop WorldLens-Agent, an evaluation model distilled from these annotations to enable scalable, explainable scoring. Together, the benchmark, dataset, and agent form a unified ecosystem for measuring world fidelity -- standardizing how future models are judged not only by how real they look, but by how real they behave.
生成式世界模型正在重塑具身人工智能,使得代理能够合成出看起来逼真的4D驾驶环境,但这些环境在物理上或行为上往往失败。尽管取得了快速进展,该领域仍然缺乏一种统一的方法来评估生成的世界是否保持几何一致性、遵循物理法则或支持可靠控制。我们引入了WorldLens,这是一个全方位基准测试工具,用于评估模型构建、理解和在其生成世界中的表现能力。它涵盖了五个方面——生成、重建、行为跟随、下游任务和人类偏好——共同覆盖了视觉逼真度、几何一致性和物理可能性以及功能可靠性。在这些维度上,现有的任何世界模型都无法普遍优秀:那些具有强纹理的往往违反物理学定律,而几何稳定型则缺乏行为忠实性。为了使客观标准与人类判断对齐,我们进一步构建了WorldLens-26K,这是一个大规模的人类标注视频数据集,包含数值评分和文本理由,并开发了WorldLens-Agent,一个从这些注释中提炼出的评估模型,以实现可扩展、解释性的评分方式。这一基准测试工具、数据集和代理共同构成了衡量世界真实度统一生态系统的一部分——通过标准化未来的模型不仅要看其看起来有多逼真,还要关注其行为的真实程度。
https://arxiv.org/abs/2512.10958
A deep understanding of the physical world is a central goal for embodied AI and realistic simulation. While current models excel at capturing an object's surface geometry and appearance, they largely neglect its internal physical properties. This omission is critical, as properties like volumetric density are fundamental for predicting an object's center of mass, stability, and interaction dynamics in applications ranging from robotic manipulation to physical simulation. The primary bottleneck has been the absence of large-scale, real-world data. To bridge this gap, we introduce XDen-1K, the first large-scale, multi-modal dataset designed for real-world physical property estimation, with a particular focus on volumetric density. The core of this dataset consists of 1,000 real-world objects across 148 categories, for which we provide comprehensive multi-modal data, including a high-resolution 3D geometric model with part-level annotations and a corresponding set of real-world biplanar X-ray scans. Building upon this data, we introduce a novel optimization framework that recovers a high-fidelity volumetric density field of each object from its sparse X-ray views. To demonstrate its practical value, we add X-ray images as a conditioning signal to an existing segmentation network and perform volumetric segmentation. Furthermore, we conduct experiments on downstream robotics tasks. The results show that leveraging the dataset can effectively improve the accuracy of center-of-mass estimation and the success rate of robotic manipulation. We believe XDen-1K will serve as a foundational resource and a challenging new benchmark, catalyzing future research in physically grounded visual inference and embodied AI.
对物理世界的深刻理解是具身人工智能和现实模拟的核心目标。尽管当前的模型在捕捉物体表面几何形状和外观方面表现出色,但它们大多忽略了其内部物理属性。这种忽略至关重要,因为像体积密度这样的属性对于预测物体的质量中心、稳定性以及从机器人操作到物理仿真等各种应用中的交互动态来说是基础性的。瓶颈在于缺乏大规模的真实世界数据集。为了弥补这一差距,我们引入了XDen-1K,这是首个用于真实世界物理特性估算的多模态大规模数据集,并特别关注体积密度。 该数据集的核心由1000个来自148类别的现实世界物体组成,为这些对象提供了全面的多模态数据,包括具有部件级注释的高分辨率3D几何模型和一组对应的双平面X光扫描图像。基于此数据,我们引入了一个新颖的优化框架,可以从每个物体的稀疏X光视图中恢复出一个高保真的体积密度场。 为了展示其实际价值,我们将X光图像作为条件信号添加到现有的分割网络中,并进行体积分割。此外,我们在下游机器人任务上进行了实验。结果显示,利用该数据集可以显著提高质量中心估算的准确性以及机器人操作的成功率。 我们相信,XDen-1K将作为一个基础资源和一个具有挑战性的新基准出现,推动未来在物理基础上的视觉推理和具身人工智能研究的发展。
https://arxiv.org/abs/2512.10668
We propose LEO-RobotAgent, a general-purpose language-driven intelligent agent framework for robots. Under this framework, LLMs can operate different types of robots to complete unpredictable complex tasks across various scenarios. This framework features strong generalization, robustness, and efficiency. The application-level system built around it can fully enhance bidirectional human-robot intent understanding and lower the threshold for human-robot interaction. Regarding robot task planning, the vast majority of existing studies focus on the application of large models in single-task scenarios and for single robot types. These algorithms often have complex structures and lack generalizability. Thus, the proposed LEO-RobotAgent framework is designed with a streamlined structure as much as possible, enabling large models to independently think, plan, and act within this clear framework. We provide a modular and easily registrable toolset, allowing large models to flexibly call various tools to meet different requirements. Meanwhile, the framework incorporates a human-robot interaction mechanism, enabling the algorithm to collaborate with humans like a partner. Experiments have verified that this framework can be easily adapted to mainstream robot platforms including unmanned aerial vehicles (UAVs), robotic arms, and wheeled robot, and efficiently execute a variety of carefully designed tasks with different complexity levels. Our code is available at this https URL.
我们提出了LEO-RobotAgent,这是一个面向机器人的通用语言驱动智能代理框架。在此框架下,大规模语言模型(LLMs)能够操作不同类型机器人,在各种场景中完成不可预测的复杂任务。该框架具有强大的泛化能力、鲁棒性和效率性。围绕此构建的应用级系统可以全面提升人机意图理解,并降低人机交互门槛。 关于机器人的任务规划,现有的大多数研究主要集中在大规模模型在单一任务场景和特定类型机器人上的应用。这些算法通常结构复杂且缺乏通用性。因此,我们设计的LEO-RobotAgent框架尽可能采用精简的架构,使得大型模型可以在这一清晰框架内独立思考、计划并执行行动。我们提供了一套模块化且易于注册的工具集,允许大规模模型灵活调用各种工具以满足不同需求。同时,该框架整合了人机交互机制,使算法能够像合作伙伴一样与人类协作。 实验已验证此框架可以轻松适应包括无人驾驶飞机(UAV)、机械臂和轮式机器人在内的主流机器人平台,并能高效执行一系列精心设计的不同复杂度的任务。我们的代码可在[此处](https://example.com)获取。
https://arxiv.org/abs/2512.10605
Current embodied AI systems face severe engineering impediments, primarily characterized by poor cross-scenario adaptability, rigid inter-module coupling, and fragmented inference acceleration. To overcome these limitations, we propose RoboNeuron, a universal deployment framework for embodied intelligence. RoboNeuron is the first framework to deeply integrate the cognitive capabilities of Large Language Models (LLMs) and Vision-Language-Action (VLA) models with the real-time execution backbone of the Robot Operating System (ROS). We utilize the Model Context Protocol (MCP) as a semantic bridge, enabling the LLM to dynamically orchestrate underlying robotic tools. The framework establishes a highly modular architecture that strictly decouples sensing, reasoning, and control by leveraging ROS's unified communication interfaces. Crucially, we introduce an automated tool to translate ROS messages into callable MCP functions, significantly streamlining development. RoboNeuron significantly enhances cross-scenario adaptability and component flexibility, while establishing a systematic platform for horizontal performance benchmarking, laying a robust foundation for scalable real-world embodied applications.
当前的具身人工智能系统面临着严重的工程障碍,主要表现为跨场景适应性差、模块间耦合僵化和推理加速碎片化。为克服这些局限性,我们提出了一种通用部署框架RoboNeuron,用于支持具身智能的应用。RoboNeuron是第一个将大型语言模型(LLM)和视觉-语言-行动(VLA)模型的认知能力与机器人操作系统(ROS)的实时执行核心深度整合起来的框架。 为了实现这一目标,我们利用了模型上下文协议(MCP)作为语义桥梁,使LLM能够动态地协调底层的机器人工具。该框架建立了一个高度模块化的架构,并通过ROS统一通信接口严格解耦感知、推理和控制功能。关键的是,我们引入了一种自动化工具来将ROS消息转换为可调用的MCP函数,这显著简化了开发流程。 RoboNeuron极大地提升了跨场景适应性和组件灵活性,同时建立了一个系统化平台以进行水平性能基准测试,为大规模的实际具身应用奠定了坚实的基础。
https://arxiv.org/abs/2512.10394
The 2025 BEHAVIOR Challenge is designed to rigorously track progress toward solving long-horizon tasks by physical agents in simulated environments. BEHAVIOR-1K focuses on everyday household tasks that people most want robots to assist with and these tasks introduce long-horizon mobile manipulation challenges in realistic settings, bridging the gap between current research and real-world, human-centric applications. This report presents our solution to the 2025 BEHAVIOR Challenge in a very close 2nd place and substantially outperforms the rest of the submissions. Building on $\pi_{0.5}$, we focus on systematically building our solution by studying the effects of training techniques and data. Through careful ablations, we show the scaling power in pre-training and post-training phases for competitive performance. We summarize our practical lessons and design recommendations that we hope will provide actionable insights for the broader embodied AI community when adapting powerful foundation models to complex embodied scenarios.
2025 BEHAVIOR 挑战旨在通过模拟环境严格追踪物理代理解决长期任务的进展。BEHAVIOR-1K 重点关注人们最希望机器人协助完成的家庭日常任务,这些任务在现实环境中引入了长时间移动操作挑战,从而弥合当前研究与实际应用之间的差距。本报告介绍了我们对2025 BEHAVIOR 挑战的解决方案,在非常接近第二名的位置上大幅超越其他参赛作品。基于 $\pi_{0.5}$,我们专注于通过研究训练技术和数据的影响来系统地构建我们的解决方案。通过精心设计的消融实验,展示了在预训练和后训练阶段的竞争性能所需的扩展能力。我们总结了实用经验和设计建议,希望能为更广泛的具身AI社区提供有价值的见解,帮助他们将强大的基础模型应用于复杂的具身场景中。
https://arxiv.org/abs/2512.10071
Recent advances in foundation models have shown promising results in developing generalist robotics that can perform diverse tasks in open-ended scenarios given multimodal inputs. However, current work has been mainly focused on indoor, household scenarios. In this work, we present SimWorld-Robotics~(SWR), a simulation platform for embodied AI in large-scale, photorealistic urban environments. Built on Unreal Engine 5, SWR procedurally generates unlimited photorealistic urban scenes populated with dynamic elements such as pedestrians and traffic systems, surpassing prior urban simulations in realism, complexity, and scalability. It also supports multi-robot control and communication. With these key features, we build two challenging robot benchmarks: (1) a multimodal instruction-following task, where a robot must follow vision-language navigation instructions to reach a destination in the presence of pedestrians and traffic; and (2) a multi-agent search task, where two robots must communicate to cooperatively locate and meet each other. Unlike existing benchmarks, these two new benchmarks comprehensively evaluate a wide range of critical robot capacities in realistic scenarios, including (1) multimodal instructions grounding, (2) 3D spatial reasoning in large environments, (3) safe, long-range navigation with people and traffic, (4) multi-robot collaboration, and (5) grounded communication. Our experimental results demonstrate that state-of-the-art models, including vision-language models (VLMs), struggle with our tasks, lacking robust perception, reasoning, and planning abilities necessary for urban environments.
近期在基础模型领域的进展显示,在给定多模态输入的情况下,能够执行开放场景中多样任务的通用型机器人具有令人鼓舞的结果。然而,目前的工作主要集中在室内和家庭环境中。在此项工作中,我们介绍了SimWorld-Robotics(SWR),这是一个用于大规模、逼真城市环境中的具身人工智能的模拟平台。基于Unreal Engine 5构建,SWR可以生成无限数量的拥有动态元素如行人和交通系统的逼真城市场景,在现实感、复杂性和可扩展性方面超越了先前的城市模拟器。它还支持多机器人控制与通信。 利用这些关键特性,我们建立两个具有挑战性的机器人基准测试: 1. 一个涉及多模态指令跟随的任务:机器人需要根据视觉语言导航指令在行人和交通存在的情况下到达目的地。 2. 一个多代理搜索任务:两台机器人必须通过沟通合作找到并会面对方。 与现有的基准不同,这两个新的基准全面评估了机器人在现实场景中广泛的、关键的能力,包括: 1. 多模态指令的语义接地; 2. 在大型环境中的三维空间推理能力; 3. 人和交通存在时的安全远距离导航能力; 4. 多个机器人的协作能力; 5. 基于实际情境的沟通。 我们的实验结果显示,最先进的模型(包括视觉语言模型VLMs)在处理这些任务方面遇到困难,缺乏适应城市环境所需的坚固感知、推理和规划能力。
https://arxiv.org/abs/2512.10046
Long-term planning in complex, text-based environments presents significant challenges due to open-ended action spaces, ambiguous observations, and sparse feedback. Recent research suggests that large language models (LLMs) encode rich semantic knowledge about the world, which can be valuable for guiding agents in high-level reasoning and planning across both embodied and purely textual settings. However, existing approaches often depend heavily on querying LLMs during training and inference, making them computationally expensive and difficult to deploy efficiently. In addition, these methods typically employ a pretrained, unaltered LLM whose parameters remain fixed throughout training, providing no opportunity for adaptation to the target task. To address these limitations, we introduce SCOPE (Subgoal-COnditioned Pretraining for Efficient planning), a one-shot hierarchical planner that leverages LLM-generated subgoals only at initialization to pretrain a lightweight student model. Unlike prior approaches that distill LLM knowledge by repeatedly prompting the model to adaptively generate subgoals during training, our method derives subgoals directly from example trajectories. This design removes the need for repeated LLM queries, significantly improving efficiency, though at the cost of reduced explainability and potentially suboptimal subgoals. Despite their suboptimality, our results on the TextCraft environment show that LLM-generated subgoals can still serve as a strong starting point for hierarchical goal decomposition in text-based planning tasks. Compared to the LLM-based hierarchical agent ADaPT (Prasad et al., 2024), which achieves a 0.52 success rate, our method reaches 0.56 and reduces inference time from 164.4 seconds to just 3.0 seconds.
在复杂、基于文本的环境中进行长期规划面临重大挑战,这些问题源于开放式的行动空间、模糊的观察结果以及稀疏的反馈。最近的研究表明,大型语言模型(LLM)编码了丰富的关于世界的语义知识,这对于引导代理在高度抽象推理和计划方面非常有价值,无论是具身环境还是纯粹基于文本的场景都是如此。然而,现有的方法通常严重依赖于训练和推理期间查询LLM,这使得它们计算成本高昂且难以高效部署。此外,这些方法往往使用一个未经修改、预训练好的LLM,在整个训练过程中其参数保持不变,这意味着无法适应目标任务的机会。 为了解决这些问题,我们引入了SCOPE(基于子目标的条件预训练以实现高效的规划),这是一种一次性分层规划器,仅在初始化时利用LLM生成的子目标来预训练一个轻量级的学生模型。与之前的方法不同的是,后者通过反复提示模型自适应地生成训练过程中的子目标来进行知识蒸馏,我们的方法直接从示例轨迹中推导出子目标。这种设计消除了重复查询LLM的需求,大大提高了效率,尽管这可能以减少可解释性和潜在的次优子目标为代价。 虽然这些由LLM生成的子目标可能存在次优化的问题,但我们在TextCraft环境中的结果显示,它们仍然可以作为基于文本的规划任务中层次化目标分解的一个强有力的起点。与基于LLM的分层代理ADaPT(Prasad等人,2024年)相比,后者实现了52%的成功率,我们的方法达到了56%,同时将推理时间从164.4秒减少到仅仅3.0秒。
https://arxiv.org/abs/2512.09897
Navigating complex urban environments using natural language instructions poses significant challenges for embodied agents, including noisy language instructions, ambiguous spatial references, diverse landmarks, and dynamic street scenes. Current visual navigation methods are typically limited to simulated or off-street environments, and often rely on precise goal formats, such as specific coordinates or images. This limits their effectiveness for autonomous agents like last-mile delivery robots navigating unfamiliar cities. To address these limitations, we introduce UrbanNav, a scalable framework that trains embodied agents to follow free-form language instructions in diverse urban settings. Leveraging web-scale city walking videos, we develop an scalable annotation pipeline that aligns human navigation trajectories with language instructions grounded in real-world landmarks. UrbanNav encompasses over 1,500 hours of navigation data and 3 million instruction-trajectory-landmark triplets, capturing a wide range of urban scenarios. Our model learns robust navigation policies to tackle complex urban scenarios, demonstrating superior spatial reasoning, robustness to noisy instructions, and generalization to unseen urban settings. Experimental results show that UrbanNav significantly outperforms existing methods, highlighting the potential of large-scale web video data to enable language-guided, real-world urban navigation for embodied agents.
使用自然语言指令在复杂的城市环境中导航对于实体代理(如机器人)来说提出了重大挑战,包括含糊不清的语言指令、模棱两可的空间参考、多样化的地标和动态的街景。当前的视觉导航方法通常局限于模拟或非道路环境,并且往往依赖于精确的目标格式,例如特定坐标或图像。这限制了它们在陌生城市环境中为自主代理(如最后一公里配送机器人)提供有效导航的能力。 为了克服这些局限性,我们引入了UrbanNav,这是一个可扩展的框架,用于训练实体代理遵循多样化的城市环境中的自由形式语言指令。利用网络规模的城市步行视频,我们开发了一个可扩展的注释管道,该管道将人类导航轨迹与基于现实世界地标的语言指令对齐。UrbanNav包括超过1500小时的导航数据和300万条指令-轨迹-地标三元组,涵盖了广泛的都市场景。我们的模型学习了应对复杂城市场景的强大导航策略,展示了出色的时空推理能力、对含糊不清指令的鲁棒性以及在未见过的城市环境中的一般化性能。 实验结果表明,UrbanNav显著优于现有方法,并突显了大规模网络视频数据支持实体代理进行语言引导的真实世界都市导航的巨大潜力。
https://arxiv.org/abs/2512.09607
This chapter argues that the reliability of agentic and generative AI is chiefly an architectural property. We define agentic systems as goal-directed, tool-using decision makers operating in closed loops, and show how reliability emerges from principled componentisation (goal manager, planner, tool-router, executor, memory, verifiers, safety monitor, telemetry), disciplined interfaces (schema-constrained, validated, least-privilege tool calls), and explicit control and assurance loops. Building on classical foundations, we propose a practical taxonomy-tool-using agents, memory-augmented agents, planning and self-improvement agents, multi-agent systems, and embodied or web agents - and analyse how each pattern reshapes the reliability envelope and failure modes. We distil design guidance on typed schemas, idempotency, permissioning, transactional semantics, memory provenance and hygiene, runtime governance (budgets, termination conditions), and simulate-before-actuate safeguards.
本章论证了代理型和生成式人工智能的可靠性主要是架构属性。我们将目标导向、工具使用的决策制定者定义为代理系统,并展示了这种闭环操作中的可靠性如何从原则化的组件化(目标管理器、计划器、工具路由器、执行器、内存、验证器、安全监控器、遥测)以及严格的接口(模式约束、验证、最小权限的工具调用)、明确控制和保障循环中产生。 基于经典基础,我们提出了一个实用的分类体系——使用工具的代理、记忆增强型代理、规划与自我改进型代理、多代理系统、实体或网络代理,并分析了每种模式如何重塑可靠性范围及故障模式。我们将提炼关于类型化模式、幂等性、授权、事务语义、内存来源和卫生、运行时治理(预算、终止条件)、模拟先行的保障措施的设计指导方针。
https://arxiv.org/abs/2512.09458