Abstract
Vision-and-Language Navigation (VLN) requires agents to follow natural language instructions through environments, with memory-persistent variants demanding progressive improvement through accumulated experience. Existing approaches for memory-persistent VLN face critical limitations: they lack effective memory access mechanisms, instead relying on entire memory incorporation or fixed-horizon lookup, and predominantly store only environmental observations while neglecting navigation behavioral patterns that encode valuable decision-making strategies. We present Memoir, which employs imagination as a retrieval mechanism grounded by explicit memory: a world model imagines future navigation states as queries to selectively retrieve relevant environmental observations and behavioral histories. The approach comprises: 1) a language-conditioned world model that imagines future states serving dual purposes: encoding experiences for storage and generating retrieval queries; 2) Hybrid Viewpoint-Level Memory that anchors both observations and behavioral patterns to viewpoints, enabling hybrid retrieval; and 3) an experience-augmented navigation model that integrates retrieved knowledge through specialized encoders. Extensive evaluation across diverse memory-persistent VLN benchmarks with 10 distinctive testing scenarios demonstrates Memoir's effectiveness: significant improvements across all scenarios, with 5.4% SPL gains on IR2R over the best memory-persistent baseline, accompanied by 8.3x training speedup and 74% inference memory reduction. The results validate that predictive retrieval of both environmental and behavioral memories enables more effective navigation, with analysis indicating substantial headroom (73.3% vs 93.4% upper bound) for this imagination-guided paradigm. Code at this https URL.
Abstract (translated)
视觉和语言导航(VLN)要求代理能够通过遵循自然语言指令在环境中移动,并且持久性记忆变体需要通过累积经验逐步改进。现有的持久性内存VLN方法面临关键限制:它们缺乏有效的内存访问机制,依赖于整个内存整合或固定时间范围的查找,并主要存储环境观测数据而忽略了导航行为模式所编码的重要决策策略。我们提出了Memoir系统,该系统利用想象作为检索机制,由明确的记忆支持:一个世界模型通过想象未来导航状态来查询相关环境观测和行为历史。 这个方法包括以下部分: 1. 语言调节的世界模型,能够想象未来状态,并将其用于双重目的——编码经验以存储和生成检索查询。 2. 混合视角级内存,它将观察和行为模式锚定在视点上,从而支持混合检索。 3. 经验增强的导航模型,通过专用编码器整合检索到的知识。 跨多样化的持久性记忆VLN基准测试进行广泛的评估,并涵盖10个不同的测试场景表明了Memoir的有效性:所有场景中的显著改进,在IR2R上的SPL(成功率)比最佳持久性基线高出5.4%,并伴随着8.3倍的训练速度提升和74%的推断内存减少。结果验证了环境和行为记忆的预测检索能够实现更有效的导航,分析表明这种基于想象引导的方法仍有很大的改进空间(73.3%与93.4%上限相比)。Memoir的代码可以在提供的链接中找到。
URL
https://arxiv.org/abs/2510.08553