Paper Reading AI Learner

VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory

2026-01-13 15:43:43
Shaoan Wang, Yuanfei Luo, Xingyu Chen, Aocheng Luo, Dongyue Li, Chang Liu, Sheng Chen, Yangang Zhang, Junzhi Yu

Abstract

VLA models have shown promising potential in embodied navigation by unifying perception and planning while inheriting the strong generalization abilities of large VLMs. However, most existing VLA models rely on reactive mappings directly from observations to actions, lacking the explicit reasoning capabilities and persistent memory required for complex, long-horizon navigation tasks. To address these challenges, we propose VLingNav, a VLA model for embodied navigation grounded in linguistic-driven cognition. First, inspired by the dual-process theory of human cognition, we introduce an adaptive chain-of-thought mechanism, which dynamically triggers explicit reasoning only when necessary, enabling the agent to fluidly switch between fast, intuitive execution and slow, deliberate planning. Second, to handle long-horizon spatial dependencies, we develop a visual-assisted linguistic memory module that constructs a persistent, cross-modal semantic memory, enabling the agent to recall past observations to prevent repetitive exploration and infer movement trends for dynamic environments. For the training recipe, we construct Nav-AdaCoT-2.9M, the largest embodied navigation dataset with reasoning annotations to date, enriched with adaptive CoT annotations that induce a reasoning paradigm capable of adjusting both when to think and what to think about. Moreover, we incorporate an online expert-guided reinforcement learning stage, enabling the model to surpass pure imitation learning and to acquire more robust, self-explored navigation behaviors. Extensive experiments demonstrate that VLingNav achieves state-of-the-art performance across a wide range of embodied navigation benchmarks. Notably, VLingNav transfers to real-world robotic platforms in a zero-shot manner, executing various navigation tasks and demonstrating strong cross-domain and cross-task generalization.

Abstract (translated)

VLA模型在具身导航中展现出巨大的潜力,通过统一感知与规划,并继承大型视觉语言模型(VLMs)的强大泛化能力。然而,大多数现有的VLA模型依赖于从观察到动作的反应式映射,缺乏执行复杂、长期任务所需的明确推理能力和持久记忆功能。为了解决这些挑战,我们提出了VLingNav,这是一种基于语言驱动认知的具身导航VLA模型。 首先,借鉴人类认知的双过程理论,我们引入了一种适应性的链式思维机制(chain-of-thought),该机制能根据需要动态触发明确推理,使代理能够在快速直观执行和慢速深思熟虑规划之间灵活切换。其次,为了处理长期的空间依赖关系,我们开发了一个辅助语言记忆模块,构建持久的跨模态语义记忆,使代理能够回忆过去的观察结果以避免重复探索,并推断出动态环境中的移动趋势。 在训练策略方面,我们构建了Nav-AdaCoT-2.9M,这是迄今为止最大的具有推理注释的具身导航数据集,包含适应性链式思维(CoT)注释,能够诱导一种既考虑何时思考也考虑思考什么内容的推理模式。此外,我们还融入了一个在线专家指导增强学习阶段,使模型超越纯粹模仿学习,并获得更为稳健、自我探索的导航行为。 广泛的实验表明,VLingNav在各种具身导航基准测试中实现了最先进的性能。值得注意的是,VLingNav能够零样本迁移到真实世界的机器人平台,在执行多种导航任务时表现出强大的跨域和跨任务泛化能力。

URL

https://arxiv.org/abs/2601.08665

PDF

https://arxiv.org/pdf/2601.08665.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot