VLA models have shown promising potential in embodied navigation by unifying perception and planning while inheriting the strong generalization abilities of large VLMs. However, most existing VLA models rely on reactive mappings directly from observations to actions, lacking the explicit reasoning capabilities and persistent memory required for complex, long-horizon navigation tasks. To address these challenges, we propose VLingNav, a VLA model for embodied navigation grounded in linguistic-driven cognition. First, inspired by the dual-process theory of human cognition, we introduce an adaptive chain-of-thought mechanism, which dynamically triggers explicit reasoning only when necessary, enabling the agent to fluidly switch between fast, intuitive execution and slow, deliberate planning. Second, to handle long-horizon spatial dependencies, we develop a visual-assisted linguistic memory module that constructs a persistent, cross-modal semantic memory, enabling the agent to recall past observations to prevent repetitive exploration and infer movement trends for dynamic environments. For the training recipe, we construct Nav-AdaCoT-2.9M, the largest embodied navigation dataset with reasoning annotations to date, enriched with adaptive CoT annotations that induce a reasoning paradigm capable of adjusting both when to think and what to think about. Moreover, we incorporate an online expert-guided reinforcement learning stage, enabling the model to surpass pure imitation learning and to acquire more robust, self-explored navigation behaviors. Extensive experiments demonstrate that VLingNav achieves state-of-the-art performance across a wide range of embodied navigation benchmarks. Notably, VLingNav transfers to real-world robotic platforms in a zero-shot manner, executing various navigation tasks and demonstrating strong cross-domain and cross-task generalization.
VLA模型在具身导航中展现出巨大的潜力,通过统一感知与规划,并继承大型视觉语言模型(VLMs)的强大泛化能力。然而,大多数现有的VLA模型依赖于从观察到动作的反应式映射,缺乏执行复杂、长期任务所需的明确推理能力和持久记忆功能。为了解决这些挑战,我们提出了VLingNav,这是一种基于语言驱动认知的具身导航VLA模型。 首先,借鉴人类认知的双过程理论,我们引入了一种适应性的链式思维机制(chain-of-thought),该机制能根据需要动态触发明确推理,使代理能够在快速直观执行和慢速深思熟虑规划之间灵活切换。其次,为了处理长期的空间依赖关系,我们开发了一个辅助语言记忆模块,构建持久的跨模态语义记忆,使代理能够回忆过去的观察结果以避免重复探索,并推断出动态环境中的移动趋势。 在训练策略方面,我们构建了Nav-AdaCoT-2.9M,这是迄今为止最大的具有推理注释的具身导航数据集,包含适应性链式思维(CoT)注释,能够诱导一种既考虑何时思考也考虑思考什么内容的推理模式。此外,我们还融入了一个在线专家指导增强学习阶段,使模型超越纯粹模仿学习,并获得更为稳健、自我探索的导航行为。 广泛的实验表明,VLingNav在各种具身导航基准测试中实现了最先进的性能。值得注意的是,VLingNav能够零样本迁移到真实世界的机器人平台,在执行多种导航任务时表现出强大的跨域和跨任务泛化能力。
https://arxiv.org/abs/2601.08665
The advent of Large Multimodal Models (LMMs) offers a promising technology to tackle the limitations of modular design in autonomous driving, which often falters in open-world scenarios requiring sustained environmental understanding and logical reasoning. Besides, embodied artificial intelligence facilitates policy optimization through closed-loop interactions to achieve the continuous learning capability, thereby advancing autonomous driving toward embodied intelligent (El) driving. However, such capability will be constrained by relying solely on LMMs to enhance EI driving without joint decision-making. This article introduces a novel semantics and policy dual-driven hybrid decision framework to tackle this challenge, ensuring continuous learning and joint decision. The framework merges LMMs for semantic understanding and cognitive representation, and deep reinforcement learning (DRL) for real-time policy optimization. We starts by introducing the foundational principles of EI driving and LMMs. Moreover, we examine the emerging opportunities this framework enables, encompassing potential benefits and representative use cases. A case study is conducted experimentally to validate the performance superiority of our framework in completing lane-change planning task. Finally, several future research directions to empower EI driving are identified to guide subsequent work.
大型多模态模型(LMMs)的出现为解决自动驾驶中模块化设计的局限性提供了有前景的技术手段,这种局限性在需要持续环境理解和逻辑推理的开放世界场景中尤为突出。此外,具身人工智能通过闭环交互促进政策优化,实现连续学习能力,从而推动自动驾驶向具身智能(EI)驾驶发展。然而,仅依赖LMMs来增强EI驾驶而没有联合决策,会限制这种能力的发挥。本文介绍了一种新颖的语义和策略双驱动混合决策框架,以解决这一挑战,确保持续学习和联合决策。该框架将LMMs用于语义理解和认知表示,并结合深度强化学习(DRL)进行实时政策优化。 我们首先介绍了EI驾驶和LMMs的基本原理。此外,我们还探讨了这一框架带来的新兴机遇,包括潜在的好处和代表性用例。通过实验性案例研究验证了该框架在完成车道变更规划任务中的性能优势。最后,本文确定了一些未来的研究方向,以支持EI驾驶的发展,并为后续工作提供指导。 --- 具体来说: 1. **背景介绍**:大型多模态模型(LMMs)的出现解决了模块化设计带来的局限性问题,在面对需要持续环境理解和逻辑推理能力的真实世界场景时显得尤为重要。 2. **框架概述**:提出了一个全新的语义和策略双驱动混合决策框架,旨在通过结合LMMs进行语义理解与认知表示,并利用深度强化学习(DRL)实现实时策略优化来解决当前挑战。 3. **性能验证**:通过实验性的案例研究证明了所提出框架在完成车道变更规划任务中的优越性。 4. **未来展望**:指出了几个方向以支持具身智能驾驶的未来发展,为后续的研究工作提供了指导。
https://arxiv.org/abs/2601.08434
Vision-Language Models (VLMs) are increasingly deployed in autonomous driving and embodied AI systems, where reliable perception is critical for safe semantic reasoning and decision-making. While recent VLMs demonstrate strong performance on multimodal benchmarks, their robustness to realistic perception degradation remains poorly understood. In this work, we systematically study semantic misalignment in VLMs under controlled degradation of upstream visual perception, using semantic segmentation on the Cityscapes dataset as a representative perception module. We introduce perception-realistic corruptions that induce only moderate drops in conventional segmentation metrics, yet observe severe failures in downstream VLM behavior, including hallucinated object mentions, omission of safety-critical entities, and inconsistent safety judgments. To quantify these effects, we propose a set of language-level misalignment metrics that capture hallucination, critical omission, and safety misinterpretation, and analyze their relationship with segmentation quality across multiple contrastive and generative VLMs. Our results reveal a clear disconnect between pixel-level robustness and multimodal semantic reliability, highlighting a critical limitation of current VLM-based systems and motivating the need for evaluation frameworks that explicitly account for perception uncertainty in safety-critical applications.
视觉-语言模型(VLM)在自动驾驶和具身AI系统中越来越广泛地被部署,而在这些领域中可靠的感知对于安全的语义推理和决策至关重要。尽管最近的VLM在多模态基准测试中表现出强大的性能,但它们对现实世界中感知退化情况下的鲁棒性仍然知之甚少。在这项工作中,我们系统地研究了在上游视觉感知受控退化的条件下VLM中的语义错配问题,使用Cityscapes数据集上的语义分割作为典型的感知模块。我们引入了一组逼真的感知退化形式,这些退化仅导致传统分割指标有轻微下降,但观察到下游VLM行为出现严重故障,包括虚构对象提及、关键实体遗漏以及不一致的安全判断。为了量化这些效应,我们提出了一系列语言层面的错配度量标准,捕捉虚幻生成、关键信息遗漏和安全误解,并分析了它们与多个对比和生成型VLM之间分割质量的关系。我们的结果揭示了一个清晰的事实:像素级鲁棒性和多模态语义可靠性之间的明显脱节,突显出当前基于VLM系统的重大限制,并强调在安全性至关重要的应用中需要明确考虑感知不确定性评估框架的必要性。
https://arxiv.org/abs/2601.08355
Animal behavior reflects interactions between the nervous system, body, and environment. Therefore, biomechanics and environmental context must be considered to dissect algorithms for behavioral control. This is enabled by leveraging neuromechanical digital twins: computational models that embed artificial neural controllers within realistic body models in simulated environments. Here we review advances in the creation and use of neuromechanical digital twins while also highlighting emerging opportunities for the future. First, we illustrate how neuromechanical models allow researchers to infer hidden biophysical variables that may be difficult to measure experimentally. Additionally, by perturbing these models, one can generate new experimentally testable hypotheses. Next, we explore how neuromechanical twins have been used to foster a deeper exchange between neuroscience, robotics, and machine learning. Finally, we show how neuromechanical twins can advance healthcare. We envision that coupling studies on animals with active probing of their neuromechanical twins will greatly accelerate neuroscientific discovery.
动物行为反映了神经系统、身体和环境之间的相互作用。因此,为了剖析控制行为的算法,必须考虑生物力学和环境背景。这可以通过利用神经机械数字孪生来实现:将人工神经控制器嵌入到模拟环境中真实的机体模型中的计算模型。在这里,我们回顾了在创建和使用神经机械数字孪生方面的进展,并同时强调未来出现的新机遇。 首先,我们将展示神经机械模型如何使研究人员能够推断出可能难以通过实验测量的隐含生物物理变量。此外,通过扰动这些模型,可以生成新的可实验验证假设。其次,我们探讨了神经机械孪生如何被用于促进神经科学、机器人技术与机器学习之间的更深层次交流。最后,我们将展示神经机械孪生如何推进医疗保健领域的发展。 我们设想将对动物的研究与其神经机械孪生的主动探索相结合,这将极大地加速神经科学研究的进展。
https://arxiv.org/abs/2601.08056
Video generation models have emerged as high-fidelity models of the physical world, capable of synthesizing high-quality videos capturing fine-grained interactions between agents and their environments conditioned on multi-modal user inputs. Their impressive capabilities address many of the long-standing challenges faced by physics-based simulators, driving broad adoption in many problem domains, e.g., robotics. For example, video models enable photorealistic, physically consistent deformable-body simulation without making prohibitive simplifying assumptions, which is a major bottleneck in physics-based simulation. Moreover, video models can serve as foundation world models that capture the dynamics of the world in a fine-grained and expressive way. They thus overcome the limited expressiveness of language-only abstractions in describing intricate physical interactions. In this survey, we provide a review of video models and their applications as embodied world models in robotics, encompassing cost-effective data generation and action prediction in imitation learning, dynamics and rewards modeling in reinforcement learning, visual planning, and policy evaluation. Further, we highlight important challenges hindering the trustworthy integration of video models in robotics, which include poor instruction following, hallucinations such as violations of physics, and unsafe content generation, in addition to fundamental limitations such as significant data curation, training, and inference costs. We present potential future directions to address these open research challenges to motivate research and ultimately facilitate broader applications, especially in safety-critical settings.
视频生成模型已经发展成为高保真的物理世界模拟器,能够根据多模态用户输入合成高质量的视频,捕捉代理与其环境之间精细互动。这些模型的出色能力解决了基于物理的仿真器长期面临的许多挑战,并在多个领域得到了广泛应用,例如机器人技术。例如,视频模型能够在不做出禁止性简化假设的情况下实现逼真且物理一致性的可变形体模拟,这一直是一个物理基础仿真中的重大瓶颈。此外,视频模型可以作为细粒度和表达力强的基础世界模型,克服了仅使用语言抽象描述复杂物理互动的局限性。 在这篇综述中,我们回顾了视频模型及其在机器人领域的应用,包括低成本数据生成、模仿学习中的动作预测、强化学习中的动力学与奖励建模、视觉规划以及政策评估。此外,我们也指出了阻碍视频模型在机器人领域可信整合的重要挑战,这些问题包括指令执行能力差、诸如违反物理定律的幻觉效应及不安全内容生成等,并且还包括重大数据整理、训练和推理成本等基本限制。 为了应对这些开放性研究挑战,我们提出了未来的发展方向以激发进一步的研究并最终推动更广泛的应用,特别是在对安全性要求极高的场景中。
https://arxiv.org/abs/2601.07823
Spatial intelligence refers to the ability to perceive, reason about, and describe objects and their relationships within three-dimensional environments, forming a foundation for embodied perception and scene understanding. 3D captioning aims to describe 3D scenes in natural language; however, it remains challenging due to the sparsity and irregularity of point clouds and, more critically, the weak grounding and limited out-of-distribution (OOD) generalization of existing captioners across drastically different environments, including indoor and outdoor 3D scenes. To address this challenge, we propose 3D CoCa v2, a generalizable 3D captioning framework that unifies contrastive vision-language learning with 3D caption generation and further improves robustness via test-time search (TTS) without updating the captioner parameters. 3D CoCa v2 builds on a frozen CLIP-based semantic prior, a spatially-aware 3D scene encoder for geometry, and a multimodal decoder jointly optimized with contrastive and captioning objectives, avoiding external detectors or handcrafted proposals. At inference, TTS produces diverse caption candidates and performs reward-guided selection using a compact scene summary. Experiments show improvements over 3D CoCa of +1.50 CIDEr@0.5IoU on ScanRefer and +1.61 CIDEr@0.5IoU on Nr3D, and +3.8 CIDEr@0.25 in zero-shot OOD evaluation on TOD3Cap. Code will be released at this https URL.
空间智能是指在三维环境中感知、推理和描述物体及其关系的能力,它是具身感知和场景理解的基础。3D 描述的目标是用自然语言描述3D场景;然而,由于点云的稀疏性和不规则性以及现有描述器在截然不同的环境(包括室内和室外3D场景)中的弱基础性和有限的域外泛化能力,这一目标仍然具有挑战性。为了解决这些挑战,我们提出了一个通用化的三维描述框架——3D CoCa v2,该框架将对比视觉语言学习与3D描述生成相结合,并通过测试时间搜索(TTS)进一步提高鲁棒性,而无需更新描述器参数。 3D CoCa v2基于冻结的CLIP语义先验,具有空间感知的三维场景编码器用于几何结构,并且多模态解码器联合优化对比和描述目标,避免了对外部检测器或手工制作提案的需求。在推理过程中,TTS 生成多种描述候选,并使用紧凑的场景摘要进行奖励引导选择。 实验表明,在ScanRefer上3D CoCa v2比3D CoCa提高了1.50 CIDEr@0.5IoU,而在Nr3D上则提高了1.61 CIDEr@0.5IoU。在TOD3Cap上的零样本域外评估中,CIDEr分数在0.25 IoU阈值下增加了3.8分。代码将在此网址发布:[此URL](请确保实际提供有效的链接)。
https://arxiv.org/abs/2601.06496
In safety-critical domains, linguistic ambiguity can have severe consequences; a vague command like "Pass me the vial" in a surgical setting could lead to catastrophic errors. Yet, most embodied AI research overlooks this, assuming instructions are clear and focusing on execution rather than confirmation. To address this critical safety gap, we are the first to define Open-Vocabulary 3D Instruction Ambiguity Detection, a fundamental new task where a model must determine if a command has a single, unambiguous meaning within a given 3D scene. To support this research, we build Ambi3D, the large-scale benchmark for this task, featuring over 700 diverse 3D scenes and around 22k instructions. Our analysis reveals a surprising limitation: state-of-the-art 3D Large Language Models (LLMs) struggle to reliably determine if an instruction is ambiguous. To address this challenge, we propose AmbiVer, a two-stage framework that collects explicit visual evidence from multiple views and uses it to guide an vision-language model (VLM) in judging instruction ambiguity. Extensive experiments demonstrate the challenge of our task and the effectiveness of AmbiVer, paving the way for safer and more trustworthy embodied AI. Code and dataset available at this https URL.
在涉及安全的关键领域中,语言的模糊性可能导致严重后果;例如,在手术环境中发出一个含糊不清的指令如“把小瓶递给我”可能会导致灾难性的错误。然而,大多数关于具身人工智能(Embodied AI)的研究都忽略了这一点,它们假设给出的指示是清晰明确的,并且主要关注执行过程而不是确认理解。 为了解决这个关键的安全缺口,我们首次定义了一个名为开放词汇3D指令模糊检测(Open-Vocabulary 3D Instruction Ambiguity Detection)的新基本任务。在这个任务中,模型必须确定在一个给定的三维场景中一个命令是否具有单一且无歧义的意义。为了支持这项研究,我们构建了Ambi3D,这是一个大规模基准测试,包含超过700个多样化的3D场景和大约22,000条指令。我们的分析揭示了一个令人惊讶的限制:最先进的3D大型语言模型(LLMs)很难可靠地判断一个指令是否具有模糊性。 为了解决这个挑战,我们提出了一种名为AmbiVer的两阶段框架,该框架从多个视角收集明确的视觉证据,并使用这些证据来指导视觉-语言模型(VLM)对指令模糊性的评判。大量的实验展示了这项任务的挑战性和AmbiVer的有效性,这为进一步开发更安全、更值得信赖的具身人工智能铺平了道路。 相关代码和数据集可以在提供的链接中获取。
https://arxiv.org/abs/2601.05991
The ability to automatically generate large-scale, interactive, and physically realistic 3D environments is crucial for advancing robotic learning and embodied intelligence. However, existing generative approaches often fail to capture the functional complexity of real-world interiors, particularly those containing articulated objects with movable parts essential for manipulation and navigation. This paper presents SceneFoundry, a language-guided diffusion framework that generates apartment-scale 3D worlds with functionally articulated furniture and semantically diverse layouts for robotic training. From natural language prompts, an LLM module controls floor layout generation, while diffusion-based posterior sampling efficiently populates the scene with articulated assets from large-scale 3D repositories. To ensure physical usability, SceneFoundry employs differentiable guidance functions to regulate object quantity, prevent articulation collisions, and maintain sufficient walkable space for robotic navigation. Extensive experiments demonstrate that our framework generates structurally valid, semantically coherent, and functionally interactive environments across diverse scene types and conditions, enabling scalable embodied AI research.
自动生成大规模、互动性强且物理现实的3D环境对于推进机器人学习和具身智能的发展至关重要。然而,现有的生成方法通常无法捕捉到真实世界室内空间的功能复杂性,尤其是那些包含可移动部件(这些部件对操作和导航至关重要的连动对象)。本文介绍了一种名为SceneFoundry的语言引导扩散框架,该框架能够根据自然语言提示生成带有功能性的家具以及语义多样布局的公寓规模3D环境,以用于机器人训练。在这个框架中,一个大型语言模型模块控制地板布局的生成,而基于扩散后验采样的方法则能高效地从大规模3D库中填充连动资产到场景中。 为了确保物理上的可用性,SceneFoundry采用了可微分指导函数来调节物体数量、防止连动碰撞,并保持足够的行进空间以供机器人导航。广泛的实验表明,我们的框架能够生成结构上有效、语义一致且功能互动的环境,适用于各种类型和条件下的场景,从而支持大规模具身AI研究的发展。
https://arxiv.org/abs/2601.05810
We present MineNPC-Task, a user-authored benchmark and evaluation harness for testing memory-aware, mixed-initiative LLM agents in open-world Minecraft. Rather than relying on synthetic prompts, tasks are elicited through formative and summative co-play with expert players, then normalized into parametric templates with explicit preconditions and dependency structure. These tasks are paired with machine-checkable validators under a bounded-knowledge policy that forbids out-of-world shortcuts. The harness captures plan, action, and memory events, including plan previews, targeted clarifications, memory reads and writes, precondition checks, and repair attempts, and reports outcomes relative to the total number of attempted subtasks using only in-world evidence. As an initial snapshot, we instantiate the framework with GPT-4o and evaluate 216 subtasks across 8 experienced players. We observe recurring breakdown patterns in code execution, inventory and tool handling, referencing, and navigation, alongside successful recoveries supported by mixed-initiative clarifications and lightweight memory use. Participants rated interaction quality and interface usability positively, while noting the need for stronger memory persistence across tasks. We release the complete task suite, validators, logs, and evaluation harness to support transparent and reproducible evaluation of future memory-aware embodied agents.
我们介绍了MineNPC-Task,这是一个由用户创作的基准和评估框架,用于测试在开放世界的Minecraft中运行的记忆感知型、混合主动型大语言模型(LLM)代理。该框架不依赖于合成提示语,而是通过与专家玩家进行形成性和总结性共玩来引发任务,并将这些任务规范化为具有明确前提条件和依赖结构的参数模板。每个任务都配有一个机器可检查的验证器,在限定知识政策下运作,禁止使用世界外捷径。该框架捕捉规划、行动和记忆事件,包括计划预览、目标澄清、读写内存、前提条件检查以及修复尝试,并仅根据在世界的证据报告相对于总尝试子任务的数量的结果。 作为初步快照,我们使用GPT-4o实例化了这一框架,并由8名经验丰富的玩家评估了216个子任务。我们在代码执行、库存和工具处理、引用以及导航方面观察到了反复出现的故障模式,同时注意到这些故障在混合主动型澄清和支持轻量级内存使用的帮助下能够得到成功恢复。参与者对交互质量和界面可用性给予了积极评价,并指出需要更强的记忆持久性来跨越不同任务。 我们发布了完整的任务套件、验证器、日志和评估框架,以支持未来记忆感知型具身代理的透明且可重复的评估。
https://arxiv.org/abs/2601.05215
For 3D perception systems to be practical in real-world applications -- from autonomous driving to embodied AI -- models must adapt to continuously evolving object definitions and sensor domains. Yet, research on continual and transfer learning in 3D point cloud perception remains underexplored compared to 2D vision -- particularly under simultaneous domain and label shifts. To address this gap, we propose the RObust Autonomous driving under Dataset shifts (ROAD) benchmark, a comprehensive evaluation suite for LiDAR-based object classification that explicitly accounts for domain shifts as well as three key forms of label evolution: class split, class expansion, and class insertion. Using large-scale datasets (Waymo, NuScenes, Argoverse2), we evaluate zero-shot transfer, linear probe, and CL, and analyze the impact of backbone architectures, training objectives, and CL methods. Our findings reveal limitations of existing approaches under realistic shifts and establish strong baselines for future research in robust 3D perception.
为了使三维感知系统在现实世界应用(从自动驾驶到具身人工智能)中变得实用,模型必须适应不断变化的对象定义和传感器领域。然而,在三维点云感知中的连续学习和迁移学习的研究相比二维视觉仍然相对较少,特别是在同时出现领域和标签转换的情况下。为了解决这一不足,我们提出了“ROAD”基准测试(Robust Autonomous Driving Under Dataset Shifts的简称),这是一个全面评估基于激光雷达的对象分类性能的评测套件,该套件明确考虑了领域转换以及三种关键形式的标签演变:类别分裂、类别扩展和类别插入。利用大规模数据集(Waymo、NuScenes 和 Argoverse2),我们评估了零样本迁移学习、线性探测和持续学习方法,并分析了骨干架构、训练目标和持续学习方法的影响。我们的研究结果揭示了现有方法在现实世界转换中的局限性,并为未来稳健的三维感知研究建立了强有力的基础。
https://arxiv.org/abs/2601.07855
Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision--language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question-relevant context at inference time and hinders complex spatial reasoning. We propose Chain-of-View (CoV) prompting, a training-free, test-time reasoning framework that transforms a VLM into an active viewpoint reasoner through a coarse-to-fine exploration process. CoV first employs a View Selection agent to filter redundant frames and identify question-aligned anchor views. It then performs fine-grained view adjustment by interleaving iterative reasoning with discrete camera actions, obtaining new observations from the underlying 3D scene representation until sufficient context is gathered or a step budget is reached. We evaluate CoV on OpenEQA across four mainstream VLMs and obtain an average +11.56\% improvement in LLM-Match, with a maximum gain of +13.62\% on Qwen3-VL-Flash. CoV further exhibits test-time scaling: increasing the minimum action budget yields an additional +2.51\% average improvement, peaking at +3.73\% on Gemini-2.5-Flash. On ScanQA and SQA3D, CoV delivers strong performance (e.g., 116 CIDEr / 31.9 EM@1 on ScanQA and 51.1 EM@1 on SQA3D). Overall, these results suggest that question-aligned view selection coupled with open-view search is an effective, model-agnostic strategy for improving spatial reasoning in 3D EQA without additional training.
在三维环境中进行具身问答(EQA)时,通常需要收集分布在多个视角且部分被遮挡的背景信息。然而,大多数最近的视觉语言模型(VLMs)受限于固定的有限输入视图集,这限制了它们获取与问题相关上下文的能力,并阻碍了复杂的空间推理过程。为此,我们提出了一种名为“View链”(CoV)的训练免费、测试时推理框架,该框架通过一个粗到细探索过程将VLM转变为具有主动视角推理能力的模型。 CoV首先使用视点选择代理来过滤冗余帧,并识别与问题对齐的关键视图。然后,它进行精细的视点调整,通过迭代推理与离散摄像机动作交替执行,从底层3D场景表示中获取新观测结果,直到收集到足够的上下文信息或达到步骤预算。 我们在OpenEQA上评估了CoV,在四个主流VLMs上的LLM-Match平均提高了11.56%,其中Qwen3-VL-Flash的最大增益为+13.62%。此外,CoV还表现出测试时的扩展性:增加最小动作预算可额外提高2.51%的平均性能,在Gemini-2.5-Flash上达到峰值提升3.73%。 在ScanQA和SQA3D数据集上,CoV也取得了强劲的表现(例如,在ScanQA上的CIDEr为116,EM@1为31.9;在SQA3D上的EM@1为51.1)。总体而言,这些结果显示了结合问题对齐视点选择与开放视角搜索的有效性,并且这是一种不依赖额外训练的模型无关策略,在三维EQA的空间推理能力上具有改进效果。
https://arxiv.org/abs/2601.05172
As world models gain momentum in Embodied AI, an increasing number of works explore using video foundation models as predictive world models for downstream embodied tasks like 3D prediction or interactive generation. However, before exploring these downstream tasks, video foundation models still have two critical questions unanswered: (1) whether their generative generalization is sufficient to maintain perceptual fidelity in the eyes of human observers, and (2) whether they are robust enough to serve as a universal prior for real-world embodied agents. To provide a standardized framework for answering these questions, we introduce the Embodied Turing Test benchmark: WoW-World-Eval (Wow,wo,val). Building upon 609 robot manipulation data, Wow-wo-val examines five core abilities, including perception, planning, prediction, generalization, and execution. We propose a comprehensive evaluation protocol with 22 metrics to assess the models' generation ability, which achieves a high Pearson Correlation between the overall score and human preference (>0.93) and establishes a reliable foundation for the Human Turing Test. On Wow-wo-val, models achieve only 17.27 on long-horizon planning and at best 68.02 on physical consistency, indicating limited spatiotemporal consistency and physical reasoning. For the Inverse Dynamic Model Turing Test, we first use an IDM to evaluate the video foundation models' execution accuracy in the real world. However, most models collapse to $\approx$ 0% success, while WoW maintains a 40.74% success rate. These findings point to a noticeable gap between the generated videos and the real world, highlighting the urgency and necessity of benchmarking World Model in Embodied AI.
随着世界模型在具身人工智能(Embodied AI)领域中的应用越来越广泛,越来越多的研究工作开始探索使用视频基础模型作为预测性世界模型来处理下游的具身任务,如3D预测或交互生成。然而,在深入研究这些下游任务之前,视频基础模型仍然面临两个关键问题未解决:(1) 它们的生成泛化能力是否足以在人类观察者的眼中保持感知保真度;(2) 它们是否足够稳健,可以作为真实世界具身代理的通用先验。为了为回答这些问题提供一个标准化框架,我们引入了具身图灵测试基准:WoW-World-Eval(Wow, wo, val)。基于609个机器人操作数据集,Wow-wo-val评估了五项核心能力,包括感知、规划、预测、泛化和执行。我们提出了一套全面的评价协议,包含22项指标来评估模型的生成能力,这些指标在总体得分与人类偏好之间达到了超过0.93的相关性,并为图灵人机测试建立了可靠的基准。在Wow-wo-val上,模型在长期规划任务中仅得到17.27分,在物理一致性方面最高得分为68.02分,这表明它们的时空一致性和物理推理能力有限。对于逆动态模型图灵测试,我们首先使用IDM来评估视频基础模型在真实世界中的执行准确性。然而,大多数模型的表现崩溃至约0%的成功率,而Wow保持了40.74%的成功率。这些发现揭示了生成的视频与现实世界之间的显著差距,并突显了对具身人工智能中世界模型基准测试的紧迫性和必要性。
https://arxiv.org/abs/2601.04137
Engineering education faces a double disruption: traditional apprenticeship models that cultivated judgment and tacit skill are eroding, just as generative AI emerges as an informal coaching partner. This convergence rekindles long-standing questions in the philosophy of AI and cognition about the limits of computation, the nature of embodied rationality, and the distinction between information processing and wisdom. Building on this rich intellectual tradition, this paper examines whether AI chatbots can provide coaching that fosters mastery rather than merely delivering information. We synthesize critical perspectives from decades of scholarship on expertise, tacit knowledge, and human-machine interaction, situating them within the context of contemporary AI-driven education. Empirically, we report findings from a mixed-methods study (N = 75 students, N = 7 faculty) exploring the use of a coaching chatbot in engineering education. Results reveal a consistent boundary: participants accept AI for technical problem solving (convergent tasks; M = 3.84 on a 1-5 Likert scale) but remain skeptical of its capacity for moral, emotional, and contextual judgment (divergent tasks). Faculty express stronger concerns over risk (M = 4.71 vs. M = 4.14, p = 0.003), and privacy emerges as a key requirement, with 64-71 percent of participants demanding strict confidentiality. Our findings suggest that while generative AI can democratize access to cognitive and procedural support, it cannot replicate the embodied, value-laden dimensions of human mentorship. We propose a multiplex coaching framework that integrates human wisdom within expert-in-the-loop models, preserving the depth of apprenticeship while leveraging AI scalability to enrich the next generation of engineering education.
工程教育正面临着双重挑战:传统的学徒制模式正在衰退,这种模式培养了判断力和隐性技能;同时,生成式人工智能作为非正式的辅导伙伴出现。这种交汇点重新激发了关于计算极限、具身理性本质以及信息处理与智慧区别的长期哲学问题。基于这一丰富的学术传统,本文探讨了AI聊天机器人是否能够提供促进掌握而非仅仅传递信息的辅导。我们综合了几十年来关于专业知识、隐性知识和人机交互的关键观点,并将这些观点置于当代人工智能驱动教育的背景下进行分析。 从实证角度来看,报告了一项多方法研究的结果(75名学生参与,7位教师),探讨了在工程教育中使用指导聊天机器人的应用。结果显示了一个一致性的界限:参与者认为AI可以用于技术问题解决(聚合任务;平均分 = 3.84,按1-5的李克特量表评定)但对其处理道德、情感和情境判断的能力持怀疑态度(发散性任务)。教师对风险的担忧更为强烈(平均分 = 4.71 对比 4.14,p=0.003),隐私成为了一个关键需求,64-71%的参与者要求严格的保密措施。 我们的发现表明,尽管生成式AI能够使认知和程序支持民主化,但它无法复制人类指导中具身性和价值观层面的内容。我们提出了一种多层辅导框架,在专家在环模式内整合人的智慧,保留学徒制的深度同时利用AI的可扩展性来丰富下一代工程教育。
https://arxiv.org/abs/2601.03693
Recent advancements in Spatial Intelligence (SI) have predominantly relied on Vision-Language Models (VLMs), yet a critical question remains: does spatial understanding originate from visual encoders or the fundamental reasoning backbone? Inspired by this question, we introduce SiT-Bench, a novel benchmark designed to evaluate the SI performance of Large Language Models (LLMs) without pixel-level input, comprises over 3,800 expert-annotated items across five primary categories and 17 subtasks, ranging from egocentric navigation and perspective transformation to fine-grained robotic manipulation. By converting single/multi-view scenes into high-fidelity, coordinate-aware textual descriptions, we challenge LLMs to perform symbolic textual reasoning rather than visual pattern matching. Evaluation results of state-of-the-art (SOTA) LLMs reveals that while models achieve proficiency in localized semantic tasks, a significant "spatial gap" remains in global consistency. Notably, we find that explicit spatial reasoning significantly boosts performance, suggesting that LLMs possess latent world-modeling potential. Our proposed dataset SiT-Bench serves as a foundational resource to foster the development of spatially-grounded LLM backbones for future VLMs and embodied agents. Our code and benchmark will be released at this https URL .
近期在空间智能(SI)领域的进展主要依赖于视觉-语言模型(VLM),但一个关键问题仍然存在:空间理解是源自视觉编码器还是基本的推理骨干?受此问题启发,我们引入了SiT-Bench这一全新基准测试工具,用于评估大型语言模型(LLM)在没有像素级输入情况下的SI性能。该基准包含超过3800项由专家标注的数据点,涵盖了五个主要类别和17个子任务,从第一人称导航和视角转换到精细的机器人操作不等。 通过将单视图或多视图场景转化为高保真、坐标感知的文字描述,我们挑战LLM进行符号文本推理而非视觉模式匹配。评估最先进(SOTA)的LLMs结果表明,虽然模型在局部语义任务中表现出色,但在全局一致性方面仍存在显著“空间差距”。值得注意的是,明确的空间推理能够显著提升性能,这暗示了LLM具有潜在的世界建模能力。 我们提出的SiT-Bench数据集旨在作为未来VLM和具身代理人的基础资源,以促进具备空间感知的大型语言模型骨干的发展。我们的代码及基准测试将在[此链接](https://this-URL.com/)公开发布。
https://arxiv.org/abs/2601.03590
We study continual skill acquisition in open-ended embodied environments where an agent must construct, refine, and reuse an expanding library of executable skills. We introduce the Programmatic Skill Network (PSN), a framework in which skills are executable symbolic programs forming a compositional network that evolves through experience. PSN defines three core mechanisms instantiated via large language models: (1)REFLECT for structured fault localization over skill compositions, (2) progressive optimization with maturity-aware update gating that stabilizes reliable skills while maintaining plasticity for uncertain ones, and (3) canonical structural refactoring under rollback validation that maintains network compactness. We further show that PSN's learning dynamics exhibit structural parallels to neural network training. Experiments on MineDojo and Crafter demonstrate robust skill reuse, rapid adaptation, and strong generalization across open-ended task distributions.\footnote{We plan to open-source the code.
我们研究了在开放式的实体环境中持续技能获取的问题,在这种环境下,代理必须构建、完善并重复使用一个不断扩展的可执行技能库。为此,我们提出了程序化技能网络(PSN)框架,该框架中技能被表示为形成组合网络的可执行符号程序,并且这个网络通过经验而演化发展。PSN定义了三个核心机制,这些机制通过大型语言模型实现:(1)REFLECT,用于对技能组合进行结构化的故障定位;(2) 带有成熟度感知更新门控的渐进优化,该机制稳定可靠的技能同时保持不确定技能的可塑性;以及 (3) 在回滚验证下的规范结构重构,以维护网络紧凑性。我们进一步展示了PSN的学习动态与神经网络训练表现出结构上的相似之处。在MineDojo和Crafter环境中的实验表明,PSN能够实现稳健的技能重用、快速适应性和跨开放式任务分布的强大泛化能力。\footnote{我们计划开源代码。}
https://arxiv.org/abs/2601.03509
We propose a maturity-based framework for certifying embodied AI systems through explicit measurement mechanisms. We argue that certifiable embodied AI requires structured assessment frameworks, quantitative scoring mechanisms, and methods for navigating multi-objective trade-offs inherent in trustworthiness evaluation. We demonstrate this approach using uncertainty quantification as an exemplar measurement mechanism and illustrate feasibility through an Uncrewed Aircraft System (UAS) detection case study.
我们提出了一种基于成熟度的框架,通过明确的测量机制来认证具身AI系统。我们认为,可认证的具身AI需要结构化的评估框架、定量评分机制以及处理可信性评估中固有的多目标权衡的方法。我们使用不确定性量化作为示例测量机制展示了这种方法,并通过无人航空系统的检测案例研究证明了其可行性。
https://arxiv.org/abs/2601.03470
Wildfire monitoring demands autonomous systems capable of reasoning under extreme visual degradation, rapidly evolving physical dynamics, and scarce real-world training data. Existing UAV navigation approaches rely on simplified simulators and supervised perception pipelines, and lack embodied agents interacting with physically realistic fire environments. We introduce FIRE-VLM, the first end-to-end vision-language model (VLM) guided reinforcement learning (RL) framework trained entirely within a high-fidelity, physics-grounded wildfire digital twin. Built from USGS Digital Elevation Model (DEM) terrain, LANDFIRE fuel inventories, and semi-physical fire-spread solvers, this twin captures terrain-induced runs, wind-driven acceleration, smoke plume occlusion, and dynamic fuel consumption. Within this environment, a PPO agent with dual-view UAV sensing is guided by a CLIP-style VLM. Wildfire-specific semantic alignment scores, derived from a single prompt describing active fire and smoke plumes, are integrated as potential-based reward shaping signals. Our contributions are: (1) a GIS-to-simulation pipeline for constructing wildfire digital twins; (2) a VLM-guided RL agent for UAV firefront tracking; and (3) a wildfire-aware reward design that combines physical terms with VLM semantics. Across five digital-twin evaluation tasks, our VLM-guided policy reduces time-to-detection by up to 6 times, increases time-in-FOV, and is, to our knowledge, the first RL-based UAV wildfire monitoring system demonstrated in kilometer-scale, physics-grounded digital-twin fires.
野火监测需要具备在极端视觉退化、快速变化的物理动态以及稀缺真实世界训练数据条件下进行推理的能力。现有无人飞行器(UAV)导航方法依赖于简化的模拟器和监督感知管道,并且缺乏与物理现实火灾环境互动的实体代理。我们引入了FIRE-VLM,这是第一个完全在高保真、基于物理原理的野火数字孪生环境中训练的端到端视觉-语言模型(VLM)引导强化学习(RL)框架。 FIRE-VLM基于美国地质调查局(USGS)的数字地形模型(DTM)和LANDFIRE燃料库存,以及半物理火灾蔓延求解器构建而成。该数字孪生能够捕捉由地形引起的快速蔓延、风驱动的加速、烟雾遮挡以及动态燃料消耗等现象。 在此环境中,一个配备双视图UAV传感器并受CLIP式VLM引导的PPO代理(强化学习算法)被部署用于跟踪火线。通过从描述活跃火灾和烟柱的单个提示中衍生出特定于野火的语义对齐得分,并将其作为基于势能的奖励塑形信号融入到训练过程中。 我们的贡献包括:(1)一种将地理信息系统(GIS)数据转换为模拟环境的管道,用于构建野火数字孪生;(2)一个由VLM引导的RL代理,适用于UAV的火灾前沿追踪任务;以及(3)一项结合了物理参数与VLM语义的新颖奖励设计方案。 在五项针对数字孪生评估的任务中,我们基于VLM指导策略的政策实现了检测时间减少多达六倍的效果,并且据我们所知,这是第一个在公里尺度、物理现实野火数字孪生环境中展示的RL基础UAV监测系统。
https://arxiv.org/abs/2601.03449
Language plays a critical role in Vision-Language-Action (VLA) models, yet the linguistic characteristics of the datasets used to train and evaluate these systems remain poorly documented. In this work, we present a systematic dataset audit of several widely used VLA corpora, aiming to characterize what kinds of instructions these datasets actually contain and how much linguistic variety they provide. We quantify instruction language along complementary dimensions-including lexical variety, duplication and overlap, semantic similarity, and syntactic complexity. Our analysis shows that many datasets rely on highly repetitive, template-like commands with limited structural variation, yielding a narrow distribution of instruction forms. We position these findings as descriptive documentation of the language signal available in current VLA training and evaluation data, intended to support more detailed dataset reporting, more principled dataset selection, and targeted curation or augmentation strategies that broaden language coverage.
语言在视觉-语言-行动(VLA)模型中扮演着关键角色,然而用于训练和评估这些系统的数据集中所包含的语言特征仍然缺乏详细的文档记录。在这项工作中,我们对几个广泛使用的VLA语料库进行了全面的数据集审查,旨在描述这些数据集中实际包含的指令类型以及它们提供的语言多样性程度。我们在互补维度上量化了指令语言——包括词汇变化、重复和重叠、语义相似性和句法复杂度。 我们的分析表明,许多数据集依赖于高度重复、类似模板的命令,且结构变化有限,导致指令形式分布狭窄。我们将这些发现定位为当前VLA训练和评估数据中可用的语言信号的描述性文档,旨在支持更详细的报告方法、更为原则的数据集选择以及有针对性的编辑或增强策略以扩大语言覆盖范围。
https://arxiv.org/abs/2601.03136
Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLM) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper revisits a fundamental yet seldom systematically studied question: how VLM choice and competence translate to downstream VLA policies performance? We introduce VLM4VLA, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters for fair and efficient comparison. Despite its simplicity, VLM4VLA proves surprisingly competitive with more sophisticated network designs. Through extensive empirical studies on various downstream tasks across three benchmarks, we find that while VLM initialization offers a consistent benefit over training from scratch, a VLM's general capabilities are poor predictors of its downstream task performance. This challenges common assumptions, indicating that standard VLM competence is necessary but insufficient for effective embodied control. We further investigate the impact of specific embodied capabilities by fine-tuning VLMs on seven auxiliary embodied tasks (e.g., embodied QA, visual pointing, depth estimation). Contrary to intuition, improving a VLM's performance on specific embodied skills does not guarantee better downstream control performance. Finally, modality-level ablations identify the visual module in VLM, rather than the language component, as the primary performance bottleneck. We demonstrate that injecting control-relevant supervision into the vision encoder of the VLM yields consistent gains, even when the encoder remains frozen during downstream fine-tuning. This isolates a persistent domain gap between current VLM pretraining objectives and the requirements of embodied action-planning.
翻译如下: Vision-Language-Action (VLA) 模型,通过将预训练的大型视觉语言模型(VLM)整合到其策略主干中而受到关注,因其具有令人瞩目的泛化能力。本文重新审视了一个基本但很少被系统研究的问题:VLM的选择及其性能如何转化为下游VLA策略的表现?我们引入了 VLM4VLA,这是一个将通用型VLM转换为VLA策略的最小适应管道,只需使用一组新的可学习参数即可进行公平而高效的比较。尽管其设计简单,VLM4VLA 证明在与更复杂的网络设计的竞争中表现得相当有竞争力。通过在三个基准数据集上的各种下游任务上广泛的实证研究,我们发现虽然 VLM 的初始化相对于从头开始训练提供了持续的优势,但一个 VLM 的通用能力是其下游任务性能的不良预测指标。这挑战了常见的假设,表明标准VLM的能力既必要又不足以实现有效的具身控制(embodied control)。进一步地,通过在七个辅助具身任务上微调VLMs(例如,具身问答、视觉指向和深度估计)来研究特定具身能力的影响。与直觉相悖的是,提高 VLM 在特定具身技能上的表现并不能保证其下游控制性能的提升。最后,在模态级别的消融分析中发现,VLM中的视觉模块而不是语言组件是主要的表现瓶颈。我们展示了将控制相关的监督注入到VLM的视觉编码器中可以持续带来收益,即使在下游微调过程中该编码器被冻结也是如此。这揭示了当前 VLM 预训练目标与具身行动规划需求之间存在的持久领域差距。
https://arxiv.org/abs/2601.03309
LLM-based agents are increasingly deployed to autonomously solve complex tasks, raising urgent needs for IP protection and regulatory provenance. While content watermarking effectively attributes LLM-generated outputs, it fails to directly identify the high-level planning behaviors (e.g., tool and subgoal choices) that govern multi-step execution. Critically, watermarking at the planning-behavior layer faces unique challenges: minor distributional deviations in decision-making can compound during long-term agent operation, degrading utility, and many agents operate as black boxes that are difficult to intervene in directly. To bridge this gap, we propose AgentMark, a behavioral watermarking framework that embeds multi-bit identifiers into planning decisions while preserving utility. It operates by eliciting an explicit behavior distribution from the agent and applying distribution-preserving conditional sampling, enabling deployment under black-box APIs while remaining compatible with action-layer content watermarking. Experiments across embodied, tool-use, and social environments demonstrate practical multi-bit capacity, robust recovery from partial logs, and utility preservation. The code is available at this https URL.
基于大语言模型(LLM)的代理越来越多地被部署来自主解决复杂任务,这引发了对知识产权保护和监管来源的紧迫需求。虽然内容水印技术能够有效归因于由LLM生成的输出,但它无法直接识别指导多步骤执行的高层次规划行为(例如工具选择和子目标设定)。关键在于,在规划行为层面进行水印标记面临着独特挑战:决策中的微小分布偏差在长期代理操作中可能会累积并降低水印的效果;此外,许多代理作为黑盒运行,难以直接干预。为弥合这一差距,我们提出了AgentMark框架,这是一种行为水印框架,它可以在保留效用的同时嵌入到规划决策的多比特标识符中。该框架通过从代理处引出明确的行为分布并应用保持分布不变性的条件抽样来工作,允许在黑盒API下部署,并与动作层面的内容水印兼容。跨具身、工具使用和社会环境的实验表明了其实用的多比特容量、部分日志记录下的稳健恢复能力以及效用保留特性。相关代码可在提供的链接中获取。
https://arxiv.org/abs/2601.03294