Boter: Bootstrapping Knowledge Selection and Question Answering for Knowledge-based VQA

Abstract
Abstract (translated)
URL
PDF

Abstract

Knowledge-based Visual Question Answering (VQA) requires models to incorporate external knowledge to respond to questions about visual content. Previous methods mostly follow the "retrieve and generate" paradigm. Initially, they utilize a pre-trained retriever to fetch relevant knowledge documents, subsequently employing them to generate answers. While these methods have demonstrated commendable performance in the task, they possess limitations: (1) they employ an independent retriever to acquire knowledge solely based on the similarity between the query and knowledge embeddings, without assessing whether the knowledge document is truly conducive to helping answer the question; (2) they convert the image into text and then conduct retrieval and answering in natural language space, which may not ensure comprehensive acquisition of all image information. To address these limitations, we propose Boter, a novel framework designed to bootstrap knowledge selection and question answering by leveraging the robust multimodal perception capabilities of the Multimodal Large Language Model (MLLM). The framework consists of two modules: Selector and Answerer, where both are initialized by the MLLM and parameter-efficiently finetuned in a simple cycle: find key knowledge in the retrieved knowledge documents using the Selector, and then use them to finetune the Answerer to predict answers; obtain the pseudo-labels of key knowledge documents based on the predictions of the Answerer and weak supervision labels, and then finetune the Selector to select key knowledge; repeat. Our framework significantly enhances the performance of the baseline on the challenging open-domain Knowledge-based VQA benchmark, OK-VQA, achieving a state-of-the-art accuracy of 62.83%.

Abstract (translated)

基于知识的视觉问答（VQA）需要模型包含外部知识来回答关于视觉内容的 questions。以前的方法主要遵循“检索并生成”范式。初始时，它们使用预训练的检索器来获取相关知识文档，然后使用它们生成答案。虽然这些方法在任务上表现出色，但它们具有局限性：（1）它们使用一个独立检索器仅基于查询和知识表示的相似性来获取知识，而没有评估知识文档是否确实有助于回答问题；（2）它们将图像转换为自然语言并在此基础上进行检索和回答，这可能无法确保全面获取所有图像信息。为了应对这些局限性，我们提出了Boter，一种新框架，旨在通过利用多模态感知大型语言模型的稳健多模态特征来引导知识选择和问题回答。该框架包括两个模块：选择器和回答者，它们都由MLLM初始化并按简单周期进行参数优化：使用选择器查找检索到的知识文档中的关键知识，然后使用它们来微调回答者以预测答案；根据回答者的预测获得关键知识文档的伪标签，然后微调选择器以选择关键知识；重复。我们的框架在具有挑战性的开放域知识基于 VQA 基准OK-VQA上显著增强了基线的性能，达到62.83%的 state-of-the-art 准确率。

URL

https://arxiv.org/abs/2404.13947

PDF

https://arxiv.org/pdf/2404.13947.pdf

Boter: Bootstrapping Knowledge Selection and Question Answering for Knowledge-based VQA

Abstract

Abstract (translated)

URL

PDF Copy

PDF