Towards a Unified Model for Generating Answers and Explanations in Visual Question Answering

2023-01-25 19:29:19

Chenxi Whitehouse, Tillman Weyde, Pranava Madhyastha

arXiv_CL

Abstract
Abstract (translated)
URL
PDF

Abstract

Providing explanations for visual question answering (VQA) has gained much attention in research. However, most existing systems use separate models for predicting answers and providing explanations. We argue that training explanation models independently of the QA model makes the explanations less grounded and limits performance. To address this, we propose a multitask learning approach towards a Unified Model for more grounded and consistent generation of both Answers and Explanations (UMAE). To achieve this, we add artificial prompt tokens to training instances and finetune a multimodal encoder-decoder model on various VQA tasks. In our experiments, UMAE models surpass the prior SOTA answer accuracy on A-OKVQA by 10~15%, show competitive results on OK-VQA, achieve new SOTA explanation scores on A-OKVQA and VCR, and demonstrate promising out-of-domain performance on VQA-X.

Abstract (translated)

为视觉问答提供解释(VQA)在研究中得到了许多关注。然而,大多数现有系统使用 separate 模型来预测答案并提供解释。我们认为,训练解释模型独立于QA模型会导致解释不够扎实,限制性能。为了解决这一问题,我们提出了一种多任务学习 approach,以创建一个更加扎实和一致的答案和解释生成统一模型(UMAE)。为了实现这一点,我们添加了人工prompt tokens到训练实例中,并优化了多种VQA任务中的多模态编码-解码模型。在我们的实验中,UMAE 模型在A-OKVQA上的先前最佳答案准确性提高了10~15%,在OK-VQA上表现出竞争力结果,在A-OKVQA和VCR上实现了新的SOTA解释分数,并表现出在跨域方面具有 promising 表现 VQA-X。

URL

https://arxiv.org/abs/2301.10799

PDF

https://arxiv.org/pdf/2301.10799.pdf