Abstract
The in-context learning ability of Transformer models has brought new possibilities to visual navigation. In this paper, we focus on the video navigation setting, where an in-context navigation policy needs to be learned purely from videos in an offline manner, without access to the actual environment. For this setting, we propose Navigate Only Look Once (NOLO), a method for learning a navigation policy that possesses the in-context ability and adapts to new scenes by taking corresponding context videos as input without finetuning or re-training. To enable learning from videos, we first propose a pseudo action labeling procedure using optical flow to recover the action label from egocentric videos. Then, offline reinforcement learning is applied to learn the navigation policy. Through extensive experiments on different scenes, we show that our algorithm outperforms baselines by a large margin, which demonstrates the in-context learning ability of the learned policy.
Abstract (translated)
Transformer模型的上下文学习能力为视觉导航带来了新的可能性。在本文中,我们关注视频导航设置,其中需要从离线视频中学习纯粹的上下文导航策略,而无需访问实际环境。为此设置,我们提出了仅查看一次(NOLO)方法,一种通过光学流学习导航策略的方法,具有上下文能力并能够适应新场景,而无需进行微调或重新训练。为了从视频中学习,我们首先采用光学流伪动作标注方法通过从自旋视频中恢复动作标签。然后,应用于离线强化学习来学习导航策略。通过在各种场景进行广泛的实验,我们证明了我们的算法在基线之间具有很大的优势,这证明了所学习策略的上下文学习能力。
URL
https://arxiv.org/abs/2408.01384