Abstract
Generative vision-language models (VLMs) have shown impressive performance in zero-shot vision-language tasks like image captioning and visual question answering. However, improving their zero-shot reasoning typically requires second-stage instruction tuning, which relies heavily on human-labeled or large language model-generated annotation, incurring high labeling costs. To tackle this challenge, we introduce Image-Conditioned Caption Correction (ICCC), a novel pre-training task designed to enhance VLMs' zero-shot performance without the need for labeled task-aware data. The ICCC task compels VLMs to rectify mismatches between visual and language concepts, thereby enhancing instruction following and text generation conditioned on visual inputs. Leveraging language structure and a lightweight dependency parser, we construct data samples of ICCC task from image-text datasets with low labeling and computation costs. Experimental results on BLIP-2 and InstructBLIP demonstrate significant improvements in zero-shot image-text generation-based VL tasks through ICCC instruction tuning.
Abstract (translated)
生成式视觉语言模型(VLMs)在零散射击视觉语言任务(如图像标题和视觉问题回答)中表现出色。然而,提高它们的零散射击推理通常需要第二阶段指令调整,这依赖于人类标注或大型语言模型生成的标注,导致高标注成本。为了解决这个问题,我们引入了 Image-Conditioned Caption Correction(ICCC)这一新颖的预训练任务,旨在在不需要标注任务感知数据的情况下增强VLMs的零散射击性能。ICCC 任务要求VLMs修复视觉和语言概念之间的不匹配,从而提高指令跟随和文本生成条件是基于视觉输入。利用语言结构和轻量级依赖解析器,我们通过低标注和计算成本的图像文本数据集构建了ICCC任务的数据样本。在BLIP-2和InstructBLIP上的实验结果表明,通过ICCC指令调整,零散射击图像文本生成任务中的VLM任务得到了显著的改进。
URL
https://arxiv.org/abs/2404.00909