Paper Reading AI Learner

Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning

2024-04-01 04:28:01
Rongjie Li, Yu Wu, Xuming He

Abstract

Generative vision-language models (VLMs) have shown impressive performance in zero-shot vision-language tasks like image captioning and visual question answering. However, improving their zero-shot reasoning typically requires second-stage instruction tuning, which relies heavily on human-labeled or large language model-generated annotation, incurring high labeling costs. To tackle this challenge, we introduce Image-Conditioned Caption Correction (ICCC), a novel pre-training task designed to enhance VLMs' zero-shot performance without the need for labeled task-aware data. The ICCC task compels VLMs to rectify mismatches between visual and language concepts, thereby enhancing instruction following and text generation conditioned on visual inputs. Leveraging language structure and a lightweight dependency parser, we construct data samples of ICCC task from image-text datasets with low labeling and computation costs. Experimental results on BLIP-2 and InstructBLIP demonstrate significant improvements in zero-shot image-text generation-based VL tasks through ICCC instruction tuning.

Abstract (translated)

生成式视觉语言模型(VLMs)在零散射击视觉语言任务(如图像标题和视觉问题回答)中表现出色。然而,提高它们的零散射击推理通常需要第二阶段指令调整,这依赖于人类标注或大型语言模型生成的标注,导致高标注成本。为了解决这个问题,我们引入了 Image-Conditioned Caption Correction(ICCC)这一新颖的预训练任务,旨在在不需要标注任务感知数据的情况下增强VLMs的零散射击性能。ICCC 任务要求VLMs修复视觉和语言概念之间的不匹配,从而提高指令跟随和文本生成条件是基于视觉输入。利用语言结构和轻量级依赖解析器,我们通过低标注和计算成本的图像文本数据集构建了ICCC任务的数据样本。在BLIP-2和InstructBLIP上的实验结果表明,通过ICCC指令调整,零散射击图像文本生成任务中的VLM任务得到了显著的改进。

URL

https://arxiv.org/abs/2404.00909

PDF

https://arxiv.org/pdf/2404.00909.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot