Paper Reading AI Learner

PASTS: Progress-Aware Spatio-Temporal Transformer Speaker For Vision-and-Language Navigation

2023-05-19 02:25:56
Liuyi Wang, Chengju Liu, Zongtao He, Shu Li, Qingqing Yan, Huiyi Chen, Qijun Chen

Abstract

Vision-and-language navigation (VLN) is a crucial but challenging cross-modal navigation task. One powerful technique to enhance the generalization performance in VLN is the use of an independent speaker model to provide pseudo instructions for data augmentation. However, current speaker models based on Long-Short Term Memory (LSTM) lack the ability to attend to features relevant at different locations and time steps. To address this, we propose a novel progress-aware spatio-temporal transformer speaker (PASTS) model that uses the transformer as the core of the network. PASTS uses a spatio-temporal encoder to fuse panoramic representations and encode intermediate connections through steps. Besides, to avoid the misalignment problem that could result in incorrect supervision, a speaker progress monitor (SPM) is proposed to enable the model to estimate the progress of instruction generation and facilitate more fine-grained caption results. Additionally, a multifeature dropout (MFD) strategy is introduced to alleviate overfitting. The proposed PASTS is flexible to be combined with existing VLN models. The experimental results demonstrate that PASTS outperforms all existing speaker models and successfully improves the performance of previous VLN models, achieving state-of-the-art performance on the standard Room-to-Room (R2R) dataset.

Abstract (translated)

视觉和语言导航(VLN)是一个关键但具有挑战性的跨modal导航任务。一种增强VLN泛化性能的强大方法是使用独立的说话者模型提供数据增强的伪指令。然而,当前基于长短期记忆(LSTM)的说话者模型缺乏在不同位置和时间步骤中关注相关特征的能力。为了解决这一问题,我们提出了一种新的进度感知空间transformer说话者(PAsts)模型,该模型使用Transformer作为网络的核心。PAsts使用空间编码器来融合全景表示,并通过步骤编码中间连接。此外,为了避免可能导致错误监督的不匹配问题,我们提出了一个说话者进度监控器(SPM),使模型能够估计指令生成的进展,并促进更精细的标题结果。此外,我们引入了一种多特征 dropout(MFD)策略,以减轻过拟合。提出的PAsts非常灵活,可以与现有的VLN模型相结合。实验结果显示,PAsts比所有现有的说话者模型表现更好,并成功地改善了先前的VLN模型的性能,在标准房间到房间(R2R)数据集上取得了最先进的表现。

URL

https://arxiv.org/abs/2305.11918

PDF

https://arxiv.org/pdf/2305.11918.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot