Paper Reading AI Learner

TDANet: Target-Directed Attention Network For Object-Goal Visual Navigation With Zero-Shot Ability

2024-04-12 09:44:18
Shiwei Lian, Feitian Zhang

Abstract

The generalization of the end-to-end deep reinforcement learning (DRL) for object-goal visual navigation is a long-standing challenge since object classes and placements vary in new test environments. Learning domain-independent visual representation is critical for enabling the trained DRL agent with the ability to generalize to unseen scenes and objects. In this letter, a target-directed attention network (TDANet) is proposed to learn the end-to-end object-goal visual navigation policy with zero-shot ability. TDANet features a novel target attention (TA) module that learns both the spatial and semantic relationships among objects to help TDANet focus on the most relevant observed objects to the target. With the Siamese architecture (SA) design, TDANet distinguishes the difference between the current and target states and generates the domain-independent visual representation. To evaluate the navigation performance of TDANet, extensive experiments are conducted in the AI2-THOR embodied AI environment. The simulation results demonstrate a strong generalization ability of TDANet to unseen scenes and target objects, with higher navigation success rate (SR) and success weighted by length (SPL) than other state-of-the-art models.

Abstract (translated)

端到端深度强化学习(DRL)在物体目标视觉导航中的泛化是一个长期以来的挑战,因为不同的测试环境中的物体类和放置位置不同。学习领域无关的视觉表示对于让训练后的DRL智能体具有泛化到未见过的场景和物体的能力至关重要。在本文中,提出了一种目标定向注意力网络(TDANet),用于学习具有零散射击能力的端到端物体目标视觉导航策略。TDANet具有一个新颖的目标注意力(TA)模块,学习物体之间的空间和语义关系,以帮助TDANet关注目标中最相关的观测物体。通过Siamese架构(SA)设计,TDANet区分了当前状态和目标状态,并生成领域无关的视觉表示。为了评估TDANet的导航性能,在AI2-THOR embodied AI环境中进行了广泛的实验。模拟结果表明,TDANet对未见过的场景和目标物体的泛化能力很强,具有比其他最先进的模型更高的导航成功率(SR)和成功加权长度(SPL)。

URL

https://arxiv.org/abs/2404.08353

PDF

https://arxiv.org/pdf/2404.08353.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot