Paper Reading AI Learner

Dream to Recall: Imagination-Guided Experience Retrieval for Memory-Persistent Vision-and-Language Navigation

2025-10-09 17:58:01
Yunzhe Xu, Yiyuan Pan, Zhe Liu

Abstract

Vision-and-Language Navigation (VLN) requires agents to follow natural language instructions through environments, with memory-persistent variants demanding progressive improvement through accumulated experience. Existing approaches for memory-persistent VLN face critical limitations: they lack effective memory access mechanisms, instead relying on entire memory incorporation or fixed-horizon lookup, and predominantly store only environmental observations while neglecting navigation behavioral patterns that encode valuable decision-making strategies. We present Memoir, which employs imagination as a retrieval mechanism grounded by explicit memory: a world model imagines future navigation states as queries to selectively retrieve relevant environmental observations and behavioral histories. The approach comprises: 1) a language-conditioned world model that imagines future states serving dual purposes: encoding experiences for storage and generating retrieval queries; 2) Hybrid Viewpoint-Level Memory that anchors both observations and behavioral patterns to viewpoints, enabling hybrid retrieval; and 3) an experience-augmented navigation model that integrates retrieved knowledge through specialized encoders. Extensive evaluation across diverse memory-persistent VLN benchmarks with 10 distinctive testing scenarios demonstrates Memoir's effectiveness: significant improvements across all scenarios, with 5.4% SPL gains on IR2R over the best memory-persistent baseline, accompanied by 8.3x training speedup and 74% inference memory reduction. The results validate that predictive retrieval of both environmental and behavioral memories enables more effective navigation, with analysis indicating substantial headroom (73.3% vs 93.4% upper bound) for this imagination-guided paradigm. Code at this https URL.

Abstract (translated)

视觉和语言导航(VLN)要求代理能够通过遵循自然语言指令在环境中移动,并且持久性记忆变体需要通过累积经验逐步改进。现有的持久性内存VLN方法面临关键限制:它们缺乏有效的内存访问机制,依赖于整个内存整合或固定时间范围的查找,并主要存储环境观测数据而忽略了导航行为模式所编码的重要决策策略。我们提出了Memoir系统,该系统利用想象作为检索机制,由明确的记忆支持:一个世界模型通过想象未来导航状态来查询相关环境观测和行为历史。 这个方法包括以下部分: 1. 语言调节的世界模型,能够想象未来状态,并将其用于双重目的——编码经验以存储和生成检索查询。 2. 混合视角级内存,它将观察和行为模式锚定在视点上,从而支持混合检索。 3. 经验增强的导航模型,通过专用编码器整合检索到的知识。 跨多样化的持久性记忆VLN基准测试进行广泛的评估,并涵盖10个不同的测试场景表明了Memoir的有效性:所有场景中的显著改进,在IR2R上的SPL(成功率)比最佳持久性基线高出5.4%,并伴随着8.3倍的训练速度提升和74%的推断内存减少。结果验证了环境和行为记忆的预测检索能够实现更有效的导航,分析表明这种基于想象引导的方法仍有很大的改进空间(73.3%与93.4%上限相比)。Memoir的代码可以在提供的链接中找到。

URL

https://arxiv.org/abs/2510.08553

PDF

https://arxiv.org/pdf/2510.08553.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot