Question Relevance in Visual Question Answering

Abstract
Abstract (translated)
URL
PDF

Abstract

Free-form and open-ended Visual Question Answering systems solve the problem of providing an accurate natural language answer to a question pertaining to an image. Current VQA systems do not evaluate if the posed question is relevant to the input image and hence provide nonsensical answers when posed with irrelevant questions to an image. In this paper, we solve the problem of identifying the relevance of the posed question to an image. We address the problem as two sub-problems. We first identify if the question is visual or not. If the question is visual, we then determine if it's relevant to the image or not. For the second problem, we generate a large dataset from existing visual question answering datasets in order to enable the training of complex architectures and model the relevance of a visual question to an image. We also compare the results of our Long Short-Term Memory Recurrent Neural Network based models to Logistic Regression, XGBoost and multi-layer perceptron based approaches to the problem.

Abstract (translated)

自由形式和开放式视觉问题应答系统解决了为与图像有关的问题提供准确的自然语言答案的问题。当前的VQA系统不评估所提出的问题是否与输入图像相关，因此当对图像提出不相关的问题时提供无意义的答案。在本文中，我们解决了识别提出的问题与图像的相关性的问题。我们将这个问题作为两个子问题来解决。我们首先确定问题是否是可视的。如果问题是可视的，我们将确定它是否与图像相关。对于第二个问题，我们从现有的视觉问答数据集生成一个大型数据集，以便能够训练复杂的体系结构并模拟视觉问题与图像的相关性。我们还将基于长短期记忆回归神经网络模型的结果与Logistic回归，XGBoost和基于多层感知器的问题方法进行了比较。

URL

https://arxiv.org/abs/1807.08435

PDF

https://arxiv.org/pdf/1807.08435.pdf