Paper Reading AI Learner

AdaVLN: Towards Visual Language Navigation in Continuous Indoor Environments with Moving Humans

2024-11-27 17:36:08
Dillon Loh, Tomasz Bednarz, Xinxing Xia, Frank Guan

Abstract

Visual Language Navigation is a task that challenges robots to navigate in realistic environments based on natural language instructions. While previous research has largely focused on static settings, real-world navigation must often contend with dynamic human obstacles. Hence, we propose an extension to the task, termed Adaptive Visual Language Navigation (AdaVLN), which seeks to narrow this gap. AdaVLN requires robots to navigate complex 3D indoor environments populated with dynamically moving human obstacles, adding a layer of complexity to navigation tasks that mimic the real-world. To support exploration of this task, we also present AdaVLN simulator and AdaR2R datasets. The AdaVLN simulator enables easy inclusion of fully animated human models directly into common datasets like Matterport3D. We also introduce a "freeze-time" mechanism for both the navigation task and simulator, which pauses world state updates during agent inference, enabling fair comparisons and experimental reproducibility across different hardware. We evaluate several baseline models on this task, analyze the unique challenges introduced by AdaVLN, and demonstrate its potential to bridge the sim-to-real gap in VLN research.

Abstract (translated)

视觉语言导航是一项挑战机器人根据自然语言指令在现实环境中进行导航的任务。虽然之前的研究主要集中在静态环境上,但在真实世界中,导航常常需要应对动态的人类障碍物。因此,我们提出了一项任务扩展,称为自适应视觉语言导航(AdaVLN),旨在缩小这一差距。AdaVLN要求机器人在充满动态移动人类障碍物的复杂3D室内环境中进行导航,这为模拟现实世界的导航任务增添了复杂性。为了支持这项任务的研究,我们也推出了AdaVLN仿真器和AdaR2R数据集。AdaVLN仿真器能够轻松地将完全动画化的人类模型直接纳入常见的数据集中,如Matterport3D。我们还引入了一种“冻结时间”机制,既适用于导航任务也适用于仿真器,在代理推理期间暂停世界状态更新,使得在不同硬件上进行公平比较和实验再现成为可能。我们在该任务中评估了几种基线模型,分析了AdaVLN带来的独特挑战,并展示了它在视觉语言导航研究中的模拟到现实差距方面的潜力。

URL

https://arxiv.org/abs/2411.18539

PDF

https://arxiv.org/pdf/2411.18539.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot