Paper Reading AI Learner

Boter: Bootstrapping Knowledge Selection and Question Answering for Knowledge-based VQA

2024-04-22 07:44:20
Dongze Hao, Qunbo Wang, Longteng Guo, Jie Jiang, Jing Liu

Abstract

Knowledge-based Visual Question Answering (VQA) requires models to incorporate external knowledge to respond to questions about visual content. Previous methods mostly follow the "retrieve and generate" paradigm. Initially, they utilize a pre-trained retriever to fetch relevant knowledge documents, subsequently employing them to generate answers. While these methods have demonstrated commendable performance in the task, they possess limitations: (1) they employ an independent retriever to acquire knowledge solely based on the similarity between the query and knowledge embeddings, without assessing whether the knowledge document is truly conducive to helping answer the question; (2) they convert the image into text and then conduct retrieval and answering in natural language space, which may not ensure comprehensive acquisition of all image information. To address these limitations, we propose Boter, a novel framework designed to bootstrap knowledge selection and question answering by leveraging the robust multimodal perception capabilities of the Multimodal Large Language Model (MLLM). The framework consists of two modules: Selector and Answerer, where both are initialized by the MLLM and parameter-efficiently finetuned in a simple cycle: find key knowledge in the retrieved knowledge documents using the Selector, and then use them to finetune the Answerer to predict answers; obtain the pseudo-labels of key knowledge documents based on the predictions of the Answerer and weak supervision labels, and then finetune the Selector to select key knowledge; repeat. Our framework significantly enhances the performance of the baseline on the challenging open-domain Knowledge-based VQA benchmark, OK-VQA, achieving a state-of-the-art accuracy of 62.83%.

Abstract (translated)

基于知识的视觉问答(VQA)需要模型包含外部知识来回答关于视觉内容的 questions。 以前的方法主要遵循“检索并生成”范式。 初始时,它们使用预训练的检索器来获取相关知识文档,然后使用它们生成答案。 虽然这些方法在任务上表现出色,但它们具有局限性:(1)它们使用一个独立检索器仅基于查询和知识表示的相似性来获取知识,而没有评估知识文档是否确实有助于回答问题;(2)它们将图像转换为自然语言并在此基础上进行检索和回答,这可能无法确保全面获取所有图像信息。为了应对这些局限性,我们提出了Boter,一种新框架,旨在通过利用多模态感知大型语言模型的稳健多模态特征来引导知识选择和问题回答。该框架包括两个模块:选择器和回答者,它们都由MLLM初始化并按简单周期进行参数优化:使用选择器查找检索到的知识文档中的关键知识,然后使用它们来微调回答者以预测答案;根据回答者的预测获得关键知识文档的伪标签,然后微调选择器以选择关键知识;重复。我们的框架在具有挑战性的开放域知识基于 VQA 基准OK-VQA上显著增强了基线的性能,达到62.83%的 state-of-the-art 准确率。

URL

https://arxiv.org/abs/2404.13947

PDF

https://arxiv.org/pdf/2404.13947.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot