Paper Reading AI Learner

Tell-and-Answer: Towards Explainable Visual Question Answering using Attributes and Captions

2018-01-27 05:34:37
Qing Li, Jianlong Fu, Dongfei Yu, Tao Mei, Jiebo Luo

Abstract

Visual Question Answering (VQA) has attracted attention from both computer vision and natural language processing communities. Most existing approaches adopt the pipeline of representing an image via pre-trained CNNs, and then using the uninterpretable CNN features in conjunction with the question to predict the answer. Although such end-to-end models might report promising performance, they rarely provide any insight, apart from the answer, into the VQA process. In this work, we propose to break up the end-to-end VQA into two steps: explaining and reasoning, in an attempt towards a more explainable VQA by shedding light on the intermediate results between these two steps. To that end, we first extract attributes and generate descriptions as explanations for an image using pre-trained attribute detectors and image captioning models, respectively. Next, a reasoning module utilizes these explanations in place of the image to infer an answer to the question. The advantages of such a breakdown include: (1) the attributes and captions can reflect what the system extracts from the image, thus can provide some explanations for the predicted answer; (2) these intermediate results can help us identify the inabilities of both the image understanding part and the answer inference part when the predicted answer is wrong. We conduct extensive experiments on a popular VQA dataset and dissect all results according to several measurements of the explanation quality. Our system achieves comparable performance with the state-of-the-art, yet with added benefits of explainability and the inherent ability to further improve with higher quality explanations.

Abstract (translated)

视觉问答(VQA)已经引起了计算机视觉和自然语言处理社区的关注。大多数现有的方法采用通过预先训练的CNN表示图像的流水线,然后使用不可解释的CNN特征与问题结合来预测答案。虽然这样的端到端模型可能报告有前途的表现,但他们很少提供除VQA流程外的任何洞察。在这项工作中,我们建议将端到端的VQA分成两个步骤:解释和推理,试图通过阐明这两个步骤之间的中间结果来解释更加可解释的VQA。为此,我们首先分别使用预先训练的属性检测器和图像字幕模型来提取属性并生成描述作为图像的解释。接下来,推理模块利用这些解释代替图像来推断对问题的回答。这种分解的优点包括:(1)属性和标题可以反映系统从图像中提取的内容,因此可以为预测的答案提供一些解释; (2)当预测答案错误时,这些中间结果可以帮助我们识别图像理解部分和答案推理部分的不可行性。我们在流行的VQA数据集上进行了广泛的实验,并根据解释质量的几个测量值对所有结果进行了分析。我们的系统通过最先进的技术实现了可比较的性能,同时增加了可解释性的好处以及通过更高质量的解释进一步提高的内在能力。

URL

https://arxiv.org/abs/1801.09041

PDF

https://arxiv.org/pdf/1801.09041.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot