Masked Path Modeling for Vision-and-Language Navigation

Abstract
Abstract (translated)
URL
PDF

Abstract

Vision-and-language navigation (VLN) agents are trained to navigate in real-world environments by following natural language instructions. A major challenge in VLN is the limited availability of training data, which hinders the models' ability to generalize effectively. Previous approaches have attempted to address this issue by introducing additional supervision during training, often requiring costly human-annotated data that restricts scalability. In this paper, we introduce a masked path modeling (MPM) objective, which pretrains an agent using self-collected data for downstream navigation tasks. Our proposed method involves allowing the agent to actively explore navigation environments without a specific goal and collect the paths it traverses. Subsequently, we train the agent on this collected data to reconstruct the original path given a randomly masked subpath. This way, the agent can actively accumulate a diverse and substantial amount of data while learning conditional action generation. To evaluate the effectiveness of our technique, we conduct experiments on various VLN datasets and demonstrate the versatility of MPM across different levels of instruction complexity. Our results exhibit significant improvements in success rates, with enhancements of 1.32\%, 1.05\%, and 1.19\% on the val-unseen split of the Room-to-Room, Room-for-Room, and Room-across-Room datasets, respectively. Furthermore, we conduct an analysis that highlights the potential for additional improvements when the agent is allowed to explore unseen environments prior to testing.

Abstract (translated)

视觉和语言导航(VLN)代理通过遵循自然语言指令在真实环境中导航的训练。VLN的一个主要挑战是训练数据有限，这限制了模型的泛化能力。先前的方法试图通过在训练期间引入额外监督来解决这一问题，通常需要昂贵的人类标注数据，限制了可扩展性。在本文中，我们介绍了掩膜路径建模(MPM)目标，该目标使用自收集的数据为后续导航任务 pretrain 代理。我们的方法涉及允许代理在没有特定目标的情况下积极探索导航环境并收集路径，然后训练代理根据随机掩膜子路径恢复原始路径。这样，代理可以积极积累大量 diverse 和实质性的数据，同时学习条件行动生成。为了评估我们技术的有效性，我们对各种VLN数据集进行了实验，并证明了 MPM 在不同指令复杂度级别的泛化能力。我们的结果在成功率方面表现出显著的改善，在房间到房间、房间对房间和房间跨越房间的数据集的可见度分割split 中，成功率的提高分别为 1.32\%、1.05\% 和 1.19\%。此外，我们还进行了分析，突出了在测试前代理探索未知环境的潜在改进潜力。

URL

https://arxiv.org/abs/2305.14268

PDF

https://arxiv.org/pdf/2305.14268.pdf