Paper Reading AI Learner

Quantifying and Alleviating the Language Prior Problem in Visual Question Answering

2019-05-13 06:31:33
Yangyang Guo, Zhiyong Cheng, Liqiang Nie, Yibing Liu, Yinglong Wang, Mohan Kankanhalli

Abstract

Benefiting from the advancement of computer vision, natural language processing and information retrieval techniques, visual question answering (VQA), which aims to answer questions about an image or a video, has received lots of attentions over the past few years. Although some progress has been achieved so far, several studies have pointed out that current VQA models are heavily affected by the \emph{language prior problem}, which means they tend to answer questions based on the co-occurrence patterns of question keywords (e.g., how many) and answers (e.g., 2) instead of understanding images and questions. Existing methods attempt to solve this problem by either balancing the biased datasets or forcing models to better understand images. However, only marginal effects and even performance deterioration are observed for the first and second solution, respectively. In addition, another important issue is the lack of measurement to quantitatively measure the extent of the language prior effect, which severely hinders the advancement of related techniques. In this paper, we make contributions to solve the above problems from two perspectives. Firstly, we design a metric to quantitatively measure the language prior effect of VQA models. The proposed metric has been demonstrated to be effective in our empirical studies. Secondly, we propose a regularization method (i.e., score regularization module) to enhance current VQA models by alleviating the language prior problem as well as boosting the backbone model performance. The proposed score regularization module adopts a pair-wise learning strategy, which makes the VQA models answer the question based on the reasoning of the image (upon this question) instead of basing on question-answer patterns observed in the biased training set. The score regularization module is flexible to be integrated into various VQA models.

Abstract (translated)

视觉问答(vqa)是一种以回答图像或视频问题为目的的视觉问答技术,它得益于计算机视觉、自然语言处理和信息检索技术的进步,近年来受到了人们的广泛关注。虽然到目前为止已经取得了一些进展,但一些研究已经指出,当前的VQA模型严重受“语言优先问题”的影响,这意味着他们倾向于根据问题关键字(例如,多少)和答案(例如,2)的共现模式回答问题,而不是理解图像和问题。现有的方法试图通过平衡有偏差的数据集或强制模型更好地理解图像来解决这个问题。然而,对于第一个和第二个解,只观察到边际效应,甚至性能恶化。另外,另一个重要的问题是缺乏定量测量语言先验效应的程度,严重阻碍了相关技术的进步。本文从两个方面对解决上述问题作出了贡献。首先,我们设计了一个度量来定量测量VQA模型的语言先验效应。我们的实证研究表明,所提出的指标是有效的。其次,我们提出了一种规则化方法(即分数规则化模块),通过减少语言先验问题和提高主干模型性能来增强现有的VQA模型。提出的分数正则化模块采用了一种对偶学习策略,使得VQA模型基于图像的推理(基于该问题)来回答问题,而不是基于有偏训练集中观察到的问题回答模式。分数规则化模块可以灵活地集成到各种VQA模型中。

URL

https://arxiv.org/abs/1905.04877

PDF

https://arxiv.org/pdf/1905.04877.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot