Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning

Abstract
Abstract (translated)
URL
PDF

Abstract

Generative vision-language models (VLMs) have shown impressive performance in zero-shot vision-language tasks like image captioning and visual question answering. However, improving their zero-shot reasoning typically requires second-stage instruction tuning, which relies heavily on human-labeled or large language model-generated annotation, incurring high labeling costs. To tackle this challenge, we introduce Image-Conditioned Caption Correction (ICCC), a novel pre-training task designed to enhance VLMs' zero-shot performance without the need for labeled task-aware data. The ICCC task compels VLMs to rectify mismatches between visual and language concepts, thereby enhancing instruction following and text generation conditioned on visual inputs. Leveraging language structure and a lightweight dependency parser, we construct data samples of ICCC task from image-text datasets with low labeling and computation costs. Experimental results on BLIP-2 and InstructBLIP demonstrate significant improvements in zero-shot image-text generation-based VL tasks through ICCC instruction tuning.

Abstract (translated)

生成式视觉语言模型（VLMs）在零散射击视觉语言任务（如图像标题和视觉问题回答）中表现出色。然而，提高它们的零散射击推理通常需要第二阶段指令调整，这依赖于人类标注或大型语言模型生成的标注，导致高标注成本。为了解决这个问题，我们引入了 Image-Conditioned Caption Correction（ICCC）这一新颖的预训练任务，旨在在不需要标注任务感知数据的情况下增强VLMs的零散射击性能。ICCC 任务要求VLMs修复视觉和语言概念之间的不匹配，从而提高指令跟随和文本生成条件是基于视觉输入。利用语言结构和轻量级依赖解析器，我们通过低标注和计算成本的图像文本数据集构建了ICCC任务的数据样本。在BLIP-2和InstructBLIP上的实验结果表明，通过ICCC指令调整，零散射击图像文本生成任务中的VLM任务得到了显著的改进。

URL

https://arxiv.org/abs/2404.00909

PDF

https://arxiv.org/pdf/2404.00909.pdf

Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning

Abstract

Abstract (translated)

URL

PDF Copy

PDF