Paper Reading AI Learner

AIGeN: An Adversarial Approach for Instruction Generation in VLN

2024-04-15 18:00:30
Niyati Rawal, Roberto Bigazzi, Lorenzo Baraldi, Rita Cucchiara

Abstract

In the last few years, the research interest in Vision-and-Language Navigation (VLN) has grown significantly. VLN is a challenging task that involves an agent following human instructions and navigating in a previously unknown environment to reach a specified goal. Recent work in literature focuses on different ways to augment the available datasets of instructions for improving navigation performance by exploiting synthetic training data. In this work, we propose AIGeN, a novel architecture inspired by Generative Adversarial Networks (GANs) that produces meaningful and well-formed synthetic instructions to improve navigation agents' performance. The model is composed of a Transformer decoder (GPT-2) and a Transformer encoder (BERT). During the training phase, the decoder generates sentences for a sequence of images describing the agent's path to a particular point while the encoder discriminates between real and fake instructions. Experimentally, we evaluate the quality of the generated instructions and perform extensive ablation studies. Additionally, we generate synthetic instructions for 217K trajectories using AIGeN on Habitat-Matterport 3D Dataset (HM3D) and show an improvement in the performance of an off-the-shelf VLN method. The validation analysis of our proposal is conducted on REVERIE and R2R and highlights the promising aspects of our proposal, achieving state-of-the-art performance.

Abstract (translated)

在过去的几年里,对视觉与语言导航(VLN)的研究兴趣显著增长。VLN 是一个具有挑战性的任务,要求智能体遵循人类指令并在未知环境中进行导航,以达到预定的目标。文献中关注的是通过利用合成训练数据来增强指令可用数据集的不同方式,以提高导航代理的性能。在这项工作中,我们提出了 AIGeN,一种以生成对抗网络(GANs)为灵感的全新架构,旨在生成有意义且形式良好的合成指令,提高导航代理的性能。该模型由Transformer解码器(GPT-2)和Transformer编码器(BERT)组成。在训练阶段,解码器为一系列图像描述代理路径到特定点的句子生成,而编码器区分真实和虚假指令。实验证明,我们生成的指令的质量,并进行了广泛的消融研究。此外,我们在Habitat-Matterport 3D数据集(HM3D)上使用AIGeN生成了217K条轨迹的合成指令,并展示了离线VLN方法性能的提高。我们对我们的建议的验证分析是在REVERIE和R2R上进行的,突出了我们建议的有前景的方面,实现了最先进的性能。

URL

https://arxiv.org/abs/2404.10054

PDF

https://arxiv.org/pdf/2404.10054.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot