Quantifying and Alleviating the Language Prior Problem in Visual Question Answering

Abstract
Abstract (translated)
URL
PDF

Abstract

Benefiting from the advancement of computer vision, natural language processing and information retrieval techniques, visual question answering (VQA), which aims to answer questions about an image or a video, has received lots of attentions over the past few years. Although some progress has been achieved so far, several studies have pointed out that current VQA models are heavily affected by the \emph{language prior problem}, which means they tend to answer questions based on the co-occurrence patterns of question keywords (e.g., how many) and answers (e.g., 2) instead of understanding images and questions. Existing methods attempt to solve this problem by either balancing the biased datasets or forcing models to better understand images. However, only marginal effects and even performance deterioration are observed for the first and second solution, respectively. In addition, another important issue is the lack of measurement to quantitatively measure the extent of the language prior effect, which severely hinders the advancement of related techniques. In this paper, we make contributions to solve the above problems from two perspectives. Firstly, we design a metric to quantitatively measure the language prior effect of VQA models. The proposed metric has been demonstrated to be effective in our empirical studies. Secondly, we propose a regularization method (i.e., score regularization module) to enhance current VQA models by alleviating the language prior problem as well as boosting the backbone model performance. The proposed score regularization module adopts a pair-wise learning strategy, which makes the VQA models answer the question based on the reasoning of the image (upon this question) instead of basing on question-answer patterns observed in the biased training set. The score regularization module is flexible to be integrated into various VQA models.

Abstract (translated)

视觉问答（vqa）是一种以回答图像或视频问题为目的的视觉问答技术，它得益于计算机视觉、自然语言处理和信息检索技术的进步，近年来受到了人们的广泛关注。虽然到目前为止已经取得了一些进展，但一些研究已经指出，当前的VQA模型严重受“语言优先问题”的影响，这意味着他们倾向于根据问题关键字（例如，多少）和答案（例如，2）的共现模式回答问题，而不是理解图像和问题。现有的方法试图通过平衡有偏差的数据集或强制模型更好地理解图像来解决这个问题。然而，对于第一个和第二个解，只观察到边际效应，甚至性能恶化。另外，另一个重要的问题是缺乏定量测量语言先验效应的程度，严重阻碍了相关技术的进步。本文从两个方面对解决上述问题作出了贡献。首先，我们设计了一个度量来定量测量VQA模型的语言先验效应。我们的实证研究表明，所提出的指标是有效的。其次，我们提出了一种规则化方法（即分数规则化模块），通过减少语言先验问题和提高主干模型性能来增强现有的VQA模型。提出的分数正则化模块采用了一种对偶学习策略，使得VQA模型基于图像的推理（基于该问题）来回答问题，而不是基于有偏训练集中观察到的问题回答模式。分数规则化模块可以灵活地集成到各种VQA模型中。

URL

https://arxiv.org/abs/1905.04877

PDF

https://arxiv.org/pdf/1905.04877.pdf