Paper Reading AI Learner

DOZE: A Dataset for Open-Vocabulary Zero-Shot Object Navigation in Dynamic Environments

2024-02-29 10:03:57
Ji Ma, Hongming Dai, Yao Mu, Pengying Wu, Hao Wang, Xiaowei Chi, Yang Fei, Shanghang Zhang, Chang Liu

Abstract

Zero-Shot Object Navigation (ZSON) requires agents to autonomously locate and approach unseen objects in unfamiliar environments and has emerged as a particularly challenging task within the domain of Embodied AI. Existing datasets for developing ZSON algorithms lack consideration of dynamic obstacles, object attribute diversity, and scene texts, thus exhibiting noticeable discrepancy from real-world situations. To address these issues, we propose a Dataset for Open-Vocabulary Zero-Shot Object Navigation in Dynamic Environments (DOZE) that comprises ten high-fidelity 3D scenes with over 18k tasks, aiming to mimic complex, dynamic real-world scenarios. Specifically, DOZE scenes feature multiple moving humanoid obstacles, a wide array of open-vocabulary objects, diverse distinct-attribute objects, and valuable textual hints. Besides, different from existing datasets that only provide collision checking between the agent and static obstacles, we enhance DOZE by integrating capabilities for detecting collisions between the agent and moving obstacles. This novel functionality enables evaluation of the agents' collision avoidance abilities in dynamic environments. We test four representative ZSON methods on DOZE, revealing substantial room for improvement in existing approaches concerning navigation efficiency, safety, and object recognition accuracy. Our dataset could be found at this https URL.

Abstract (translated)

零距离物体导航(ZSON)要求智能体在未知环境中自主定位和靠近未见到的物体,这一任务在 embodied AI 领域已成为一个特别具有挑战性的任务。现有的用于开发 ZSON 算法的数据集中没有考虑到动态障碍、物体属性的多样性和场景文本,因此与现实世界情况存在明显的差异。为了解决这些问题,我们提出了一个用于开放词汇零距离物体导航在动态环境中的数据集(DOZE),它包括十个高保真的 3D 场景,超过 18k 个任务,旨在模拟复杂、动态的现实生活中场景。 具体来说,DOZE 场景特征有多名移动的人形障碍物、各种开放词汇物体、多样化的独特属性物体和有价值的文本提示。此外,与现有的仅提供代理与静态障碍物之间碰撞检查的数据集不同,我们通过整合检测代理与移动障碍物之间碰撞的功能来增强 DOZE。这种新功能使得可以在动态环境中评估代理的避障能力。我们在 DOZE 上测试了四种代表性的 ZSON 方法,揭示了现有方法在导航效率、安全性和物体识别准确性方面存在巨大的改进空间。我们的数据集可以在这个链接中找到:https://www.example.com/doze

URL

https://arxiv.org/abs/2402.19007

PDF

https://arxiv.org/pdf/2402.19007.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot