Abstract
Visual Language Navigation is a task that challenges robots to navigate in realistic environments based on natural language instructions. While previous research has largely focused on static settings, real-world navigation must often contend with dynamic human obstacles. Hence, we propose an extension to the task, termed Adaptive Visual Language Navigation (AdaVLN), which seeks to narrow this gap. AdaVLN requires robots to navigate complex 3D indoor environments populated with dynamically moving human obstacles, adding a layer of complexity to navigation tasks that mimic the real-world. To support exploration of this task, we also present AdaVLN simulator and AdaR2R datasets. The AdaVLN simulator enables easy inclusion of fully animated human models directly into common datasets like Matterport3D. We also introduce a "freeze-time" mechanism for both the navigation task and simulator, which pauses world state updates during agent inference, enabling fair comparisons and experimental reproducibility across different hardware. We evaluate several baseline models on this task, analyze the unique challenges introduced by AdaVLN, and demonstrate its potential to bridge the sim-to-real gap in VLN research.
Abstract (translated)
视觉语言导航是一项挑战机器人根据自然语言指令在现实环境中进行导航的任务。虽然之前的研究主要集中在静态环境上,但在真实世界中,导航常常需要应对动态的人类障碍物。因此,我们提出了一项任务扩展,称为自适应视觉语言导航(AdaVLN),旨在缩小这一差距。AdaVLN要求机器人在充满动态移动人类障碍物的复杂3D室内环境中进行导航,这为模拟现实世界的导航任务增添了复杂性。为了支持这项任务的研究,我们也推出了AdaVLN仿真器和AdaR2R数据集。AdaVLN仿真器能够轻松地将完全动画化的人类模型直接纳入常见的数据集中,如Matterport3D。我们还引入了一种“冻结时间”机制,既适用于导航任务也适用于仿真器,在代理推理期间暂停世界状态更新,使得在不同硬件上进行公平比较和实验再现成为可能。我们在该任务中评估了几种基线模型,分析了AdaVLN带来的独特挑战,并展示了它在视觉语言导航研究中的模拟到现实差距方面的潜力。
URL
https://arxiv.org/abs/2411.18539