Paper Reading AI Learner

Masked Path Modeling for Vision-and-Language Navigation

2023-05-23 17:20:20
Zi-Yi Dou, Feng Gao, Nanyun Peng

Abstract

Vision-and-language navigation (VLN) agents are trained to navigate in real-world environments by following natural language instructions. A major challenge in VLN is the limited availability of training data, which hinders the models' ability to generalize effectively. Previous approaches have attempted to address this issue by introducing additional supervision during training, often requiring costly human-annotated data that restricts scalability. In this paper, we introduce a masked path modeling (MPM) objective, which pretrains an agent using self-collected data for downstream navigation tasks. Our proposed method involves allowing the agent to actively explore navigation environments without a specific goal and collect the paths it traverses. Subsequently, we train the agent on this collected data to reconstruct the original path given a randomly masked subpath. This way, the agent can actively accumulate a diverse and substantial amount of data while learning conditional action generation. To evaluate the effectiveness of our technique, we conduct experiments on various VLN datasets and demonstrate the versatility of MPM across different levels of instruction complexity. Our results exhibit significant improvements in success rates, with enhancements of 1.32\%, 1.05\%, and 1.19\% on the val-unseen split of the Room-to-Room, Room-for-Room, and Room-across-Room datasets, respectively. Furthermore, we conduct an analysis that highlights the potential for additional improvements when the agent is allowed to explore unseen environments prior to testing.

Abstract (translated)

视觉和语言导航(VLN)代理通过遵循自然语言指令在真实环境中导航的训练。VLN的一个主要挑战是训练数据有限,这限制了模型的泛化能力。先前的方法试图通过在训练期间引入额外监督来解决这一问题,通常需要昂贵的人类标注数据,限制了可扩展性。在本文中,我们介绍了掩膜路径建模(MPM)目标,该目标使用自收集的数据为后续导航任务 pretrain 代理。我们的方法涉及允许代理在没有特定目标的情况下积极探索导航环境并收集路径,然后训练代理根据随机掩膜子路径恢复原始路径。这样,代理可以积极积累大量 diverse 和实质性的数据,同时学习条件行动生成。为了评估我们技术的有效性,我们对各种VLN数据集进行了实验,并证明了 MPM 在不同指令复杂度级别的泛化能力。我们的结果在成功率方面表现出显著的改善,在房间到房间、房间对房间和房间跨越房间的数据集的可见度分割split 中,成功率的提高分别为 1.32\%、1.05\% 和 1.19\%。此外,我们还进行了分析,突出了在测试前代理探索未知环境的潜在改进潜力。

URL

https://arxiv.org/abs/2305.14268

PDF

https://arxiv.org/pdf/2305.14268.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot