Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts

Abstract
Abstract (translated)
URL
PDF

Abstract

Visual question answering (VQA) is known as an AI-complete task as it requires understanding, reasoning, and inferring about the vision and the language content. Over the past few years, numerous neural architectures have been suggested for the VQA problem. However, achieving success in zero-shot VQA remains a challenge due to its requirement for advanced generalization and reasoning skills. This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline. Specifically, we explore the efficacy of utilizing image captions instead of images and leveraging large language models (LLMs) to establish a zero-shot setting. Since image captioning is the most crucial step in this process, we compare the impact of state-of-the-art image captioning models on VQA performance across various question types in terms of structure and semantics. We propose a straightforward and efficient question-driven image captioning approach within this pipeline to transfer contextual information into the question-answering (QA) model. This method involves extracting keywords from the question, generating a caption for each image-question pair using the keywords, and incorporating the question-driven caption into the LLM prompt. We evaluate the efficacy of using general-purpose and question-driven image captions in the VQA pipeline. Our study highlights the potential of employing image captions and harnessing the capabilities of LLMs to achieve competitive performance on GQA under the zero-shot setting. Our code is available at \url{this https URL}.

Abstract (translated)

视觉问题回答（VQA）被认为是AI完成的任务，因为它需要理解、推理和推断关于视觉和语言内容的视觉和语言内容。在过去的几年里，为VQA问题提出了许多神经架构建议。然而，在零散射击VQA上取得成功仍然具有挑战性，因为需要具备高级的泛化能力和推理能力。本文探讨了将图像摘要作为VQA管道中中间过程的引入对视觉问题回答效果的影响。具体来说，我们探讨了使用图像摘要而不是图像并利用大型语言模型（LLMs）建立零散射击设置的有效性。由于图像摘要是这个过程中最关键的一步，因此我们比较了最先进的图像摘要模型在各种问题类型的结构和语义方面的视觉问题回答性能。我们提出了一种直接而有效的基于问题的图像摘要方法，将上下文信息传递给问题回答（QA）模型。这种方法涉及从问题中提取关键词，为图像-问题对生成文本摘要，并将问题驱动的摘要融入LLM提示中。我们评估了使用通用和基于问题的图像摘要在VQA管道中的效果。我们的研究突出了在零散射击设置下利用图像摘要和大型语言模型的潜力，以实现GQA竞争力的性能。我们的代码可在此处访问：\url{这个链接}。

URL

https://arxiv.org/abs/2404.08589

PDF

https://arxiv.org/pdf/2404.08589.pdf

Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts

Abstract

Abstract (translated)

URL

PDF Copy

PDF