Abstract
Metrics for Visual Grounding (VG) in Visual Question Answering (VQA) systems primarily aim to measure a system's reliance on relevant parts of the image when inferring an answer to the given question. Lack of VG has been a common problem among state-of-the-art VQA systems and can manifest in over-reliance on irrelevant image parts or a disregard for the visual modality entirely. Although inference capabilities of VQA models are often illustrated by a few qualitative illustrations, most systems are not quantitatively assessed for their VG properties. We believe, an easily calculated criterion for meaningfully measuring a system's VG can help remedy this shortcoming, as well as add another valuable dimension to model evaluations and analysis. To this end, we propose a new VG metric that captures if a model a) identifies question-relevant objects in the scene, and b) actually relies on the information contained in the relevant objects when producing its answer, i.e., if its visual grounding is both "faithful" and "plausible". Our metric, called "Faithful and Plausible Visual Grounding" (FPVG), is straightforward to determine for most VQA model designs. We give a detailed description of FPVG and evaluate several reference systems spanning various VQA architectures. Code to support the metric calculations on the GQA data set is available on GitHub.
Abstract (translated)
在视觉问答系统(VQA)中,视觉基线(VG) metrics 主要用于测量系统在推断给定问题答案时对图像相关部分的依赖程度。缺乏 VG 是当前 VQA 系统中的一种普遍问题,可能会表现为过度依赖无关的图像部分或完全忽视视觉特性。虽然 VQA 模型的推断能力往往可以通过一些定性插图来展示,但大多数系统对他们的 VG 性质没有定量评估。我们相信,容易计算的标准 criterion 可以帮助弥补这一缺点,并为模型评估和分析添加另一个有价值的维度。为此,我们提出了一个新的 VG 度量方法,该方法可以捕捉如果一个模型 a) 在场景中识别相关物体,并且 b) 在产生答案时实际上依赖于相关物体中的信息,即它的视觉基线是“可靠”和“可信”的。我们的度量方法被称为“可靠可信的视觉基线分配”(FPVG),对于大多数 VQA 模型设计来说,可以轻松确定。我们详细描述了 FPVG 方法和评估了多个参考系统,支持 GQA 数据集的度量计算代码可在 GitHub 上找到。
URL
https://arxiv.org/abs/2305.15015