Abstract
A key aspect of VQA models that are interpretable is their ability to ground their answers to relevant regions in the image. Current approaches with this capability rely on supervised learning and human annotated groundings to train attention mechanisms inside the VQA architecture. Unfortunately, obtaining human annotations specific for visual grounding is difficult and expensive. In this work, we demonstrate that we can effectively train a VQA architecture with grounding supervision that can be automatically obtained from available region descriptions and object annotations. We also show that our model trained with this mined supervision generates visual groundings that achieve a higher correlation with respect to manually-annotated groundings, meanwhile achieving state-of-the-art VQA accuracy.
Abstract (translated)
可以解释的VQA模型的一个关键方面是它们能够将答案基于图像中的相关区域。具有此功能的当前方法依赖于监督学习和人类注释的基础来训练VQA架构内的注意机制。不幸的是,获得特定于视觉接地的人类注释是困难且昂贵的。在这项工作中,我们证明我们可以有效地培训具有接地监督的VQA架构,该架构可以从可用的区域描述和对象注释中自动获得。我们还表明,我们通过这种雷达监督训练的模型产生了视觉基础,可以在手动注释的基础上实现更高的相关性,同时实现最先进的VQA精度。
URL
https://arxiv.org/abs/1808.00265