Abstract
This study explores innovative methods for improving Visual Question Answering (VQA) using Generative Adversarial Networks (GANs), autoencoders, and attention mechanisms. Leveraging a balanced VQA dataset, we investigate three distinct strategies. Firstly, GAN-based approaches aim to generate answer embeddings conditioned on image and question inputs, showing potential but struggling with more complex tasks. Secondly, autoencoder-based techniques focus on learning optimal embeddings for questions and images, achieving comparable results with GAN due to better ability on complex questions. Lastly, attention mechanisms, incorporating Multimodal Compact Bilinear pooling (MCB), address language priors and attention modeling, albeit with a complexity-performance trade-off. This study underscores the challenges and opportunities in VQA and suggests avenues for future research, including alternative GAN formulations and attentional mechanisms.
Abstract (translated)
本研究探讨了使用生成对抗网络(GANs)、自动编码器(AEs)和注意机制来提高视觉问答(VQA)的创新方法。在利用平衡的VQA数据集的基础上,我们研究了三种不同的策略。首先,基于GAN的方法旨在根据图像和问题输入生成答案嵌入,表现出潜在但难以处理更复杂任务的潜力。其次,基于AE的方法专注于学习问题和图像的最佳嵌入,由于在复杂问题上的表现与GAN相当,因此取得了较好的效果。最后,结合多模态紧凑线性池化(MCB)的注意机制解决了语言预设和注意建模,但代价是复杂性和性能之间的平衡。本研究突出了VQA中的挑战和机遇,并提出了未来的研究方向,包括 alternative GAN formulations 和 attentional mechanisms.
URL
https://arxiv.org/abs/2404.13565