Abstract
In the last few years, the research interest in Vision-and-Language Navigation (VLN) has grown significantly. VLN is a challenging task that involves an agent following human instructions and navigating in a previously unknown environment to reach a specified goal. Recent work in literature focuses on different ways to augment the available datasets of instructions for improving navigation performance by exploiting synthetic training data. In this work, we propose AIGeN, a novel architecture inspired by Generative Adversarial Networks (GANs) that produces meaningful and well-formed synthetic instructions to improve navigation agents' performance. The model is composed of a Transformer decoder (GPT-2) and a Transformer encoder (BERT). During the training phase, the decoder generates sentences for a sequence of images describing the agent's path to a particular point while the encoder discriminates between real and fake instructions. Experimentally, we evaluate the quality of the generated instructions and perform extensive ablation studies. Additionally, we generate synthetic instructions for 217K trajectories using AIGeN on Habitat-Matterport 3D Dataset (HM3D) and show an improvement in the performance of an off-the-shelf VLN method. The validation analysis of our proposal is conducted on REVERIE and R2R and highlights the promising aspects of our proposal, achieving state-of-the-art performance.
Abstract (translated)
在过去的几年里,对视觉与语言导航(VLN)的研究兴趣显著增长。VLN 是一个具有挑战性的任务,要求智能体遵循人类指令并在未知环境中进行导航,以达到预定的目标。文献中关注的是通过利用合成训练数据来增强指令可用数据集的不同方式,以提高导航代理的性能。在这项工作中,我们提出了 AIGeN,一种以生成对抗网络(GANs)为灵感的全新架构,旨在生成有意义且形式良好的合成指令,提高导航代理的性能。该模型由Transformer解码器(GPT-2)和Transformer编码器(BERT)组成。在训练阶段,解码器为一系列图像描述代理路径到特定点的句子生成,而编码器区分真实和虚假指令。实验证明,我们生成的指令的质量,并进行了广泛的消融研究。此外,我们在Habitat-Matterport 3D数据集(HM3D)上使用AIGeN生成了217K条轨迹的合成指令,并展示了离线VLN方法性能的提高。我们对我们的建议的验证分析是在REVERIE和R2R上进行的,突出了我们建议的有前景的方面,实现了最先进的性能。
URL
https://arxiv.org/abs/2404.10054