Paper Reading AI Learner

Towards a Unified Model for Generating Answers and Explanations in Visual Question Answering

2023-01-25 19:29:19
Chenxi Whitehouse, Tillman Weyde, Pranava Madhyastha

Abstract

Providing explanations for visual question answering (VQA) has gained much attention in research. However, most existing systems use separate models for predicting answers and providing explanations. We argue that training explanation models independently of the QA model makes the explanations less grounded and limits performance. To address this, we propose a multitask learning approach towards a Unified Model for more grounded and consistent generation of both Answers and Explanations (UMAE). To achieve this, we add artificial prompt tokens to training instances and finetune a multimodal encoder-decoder model on various VQA tasks. In our experiments, UMAE models surpass the prior SOTA answer accuracy on A-OKVQA by 10~15%, show competitive results on OK-VQA, achieve new SOTA explanation scores on A-OKVQA and VCR, and demonstrate promising out-of-domain performance on VQA-X.

Abstract (translated)

为视觉问答提供解释(VQA)在研究中得到了许多关注。然而,大多数现有系统使用 separate 模型来预测答案并提供解释。我们认为,训练解释模型独立于QA模型会导致解释不够扎实,限制性能。为了解决这一问题,我们提出了一种多任务学习 approach,以创建一个更加扎实和一致的答案和解释生成统一模型(UMAE)。为了实现这一点,我们添加了人工prompt tokens到训练实例中,并优化了多种VQA任务中的多模态编码-解码模型。在我们的实验中,UMAE 模型在A-OKVQA上的先前最佳答案准确性提高了10~15%,在OK-VQA上表现出竞争力结果,在A-OKVQA和VCR上实现了新的SOTA解释分数,并表现出在跨域方面具有 promising 表现 VQA-X。

URL

https://arxiv.org/abs/2301.10799

PDF

https://arxiv.org/pdf/2301.10799.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot