Paper Reading AI Learner

Measuring Faithful and Plausible Visual Grounding in VQA

2023-05-24 10:58:02
Daniel Reich, Felix Putze, Tanja Schultz

Abstract

Metrics for Visual Grounding (VG) in Visual Question Answering (VQA) systems primarily aim to measure a system's reliance on relevant parts of the image when inferring an answer to the given question. Lack of VG has been a common problem among state-of-the-art VQA systems and can manifest in over-reliance on irrelevant image parts or a disregard for the visual modality entirely. Although inference capabilities of VQA models are often illustrated by a few qualitative illustrations, most systems are not quantitatively assessed for their VG properties. We believe, an easily calculated criterion for meaningfully measuring a system's VG can help remedy this shortcoming, as well as add another valuable dimension to model evaluations and analysis. To this end, we propose a new VG metric that captures if a model a) identifies question-relevant objects in the scene, and b) actually relies on the information contained in the relevant objects when producing its answer, i.e., if its visual grounding is both "faithful" and "plausible". Our metric, called "Faithful and Plausible Visual Grounding" (FPVG), is straightforward to determine for most VQA model designs. We give a detailed description of FPVG and evaluate several reference systems spanning various VQA architectures. Code to support the metric calculations on the GQA data set is available on GitHub.

Abstract (translated)

在视觉问答系统(VQA)中,视觉基线(VG) metrics 主要用于测量系统在推断给定问题答案时对图像相关部分的依赖程度。缺乏 VG 是当前 VQA 系统中的一种普遍问题,可能会表现为过度依赖无关的图像部分或完全忽视视觉特性。虽然 VQA 模型的推断能力往往可以通过一些定性插图来展示,但大多数系统对他们的 VG 性质没有定量评估。我们相信,容易计算的标准 criterion 可以帮助弥补这一缺点,并为模型评估和分析添加另一个有价值的维度。为此,我们提出了一个新的 VG 度量方法,该方法可以捕捉如果一个模型 a) 在场景中识别相关物体,并且 b) 在产生答案时实际上依赖于相关物体中的信息,即它的视觉基线是“可靠”和“可信”的。我们的度量方法被称为“可靠可信的视觉基线分配”(FPVG),对于大多数 VQA 模型设计来说,可以轻松确定。我们详细描述了 FPVG 方法和评估了多个参考系统,支持 GQA 数据集的度量计算代码可在 GitHub 上找到。

URL

https://arxiv.org/abs/2305.15015

PDF

https://arxiv.org/pdf/2305.15015.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot