Paper Reading AI Learner

NOLO: Navigate Only Look Once

2024-08-02 16:41:34
Bohan Zhou, Jiangxing Wang, Zongqing Lu

Abstract

The in-context learning ability of Transformer models has brought new possibilities to visual navigation. In this paper, we focus on the video navigation setting, where an in-context navigation policy needs to be learned purely from videos in an offline manner, without access to the actual environment. For this setting, we propose Navigate Only Look Once (NOLO), a method for learning a navigation policy that possesses the in-context ability and adapts to new scenes by taking corresponding context videos as input without finetuning or re-training. To enable learning from videos, we first propose a pseudo action labeling procedure using optical flow to recover the action label from egocentric videos. Then, offline reinforcement learning is applied to learn the navigation policy. Through extensive experiments on different scenes, we show that our algorithm outperforms baselines by a large margin, which demonstrates the in-context learning ability of the learned policy.

Abstract (translated)

Transformer模型的上下文学习能力为视觉导航带来了新的可能性。在本文中,我们关注视频导航设置,其中需要从离线视频中学习纯粹的上下文导航策略,而无需访问实际环境。为此设置,我们提出了仅查看一次(NOLO)方法,一种通过光学流学习导航策略的方法,具有上下文能力并能够适应新场景,而无需进行微调或重新训练。为了从视频中学习,我们首先采用光学流伪动作标注方法通过从自旋视频中恢复动作标签。然后,应用于离线强化学习来学习导航策略。通过在各种场景进行广泛的实验,我们证明了我们的算法在基线之间具有很大的优势,这证明了所学习策略的上下文学习能力。

URL

https://arxiv.org/abs/2408.01384

PDF

https://arxiv.org/pdf/2408.01384.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot