Paper Reading AI Learner

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

2024-04-25 07:29:17
An Yan, Zhengyuan Yang, Junda Wu, Wanrong Zhu, Jianwei Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Julian McAuley, Jianfeng Gao, Lijuan Wang

Abstract

Set-of-Mark (SoM) Prompting unleashes the visual grounding capability of GPT-4V, by enabling the model to associate visual objects with tags inserted on the image. These tags, marked with alphanumerics, can be indexed via text tokens for easy reference. Despite the extraordinary performance from GPT-4V, we observe that other Multimodal Large Language Models (MLLMs) struggle to understand these visual tags. To promote the learning of SoM prompting for open-source models, we propose a new learning paradigm: "list items one by one," which asks the model to enumerate and describe all visual tags placed on the image following the alphanumeric orders of tags. By integrating our curated dataset with other visual instruction tuning datasets, we are able to equip existing MLLMs with the SoM prompting ability. Furthermore, we evaluate our finetuned SoM models on five MLLM benchmarks. We find that this new dataset, even in a relatively small size (10k-30k images with tags), significantly enhances visual reasoning capabilities and reduces hallucinations for MLLMs. Perhaps surprisingly, these improvements persist even when the visual tags are omitted from input images during inference. This suggests the potential of "list items one by one" as a new paradigm for training MLLMs, which strengthens the object-text alignment through the use of visual tags in the training stage. Finally, we conduct analyses by probing trained models to understand the working mechanism of SoM. Our code and data are available at \url{this https URL}.

Abstract (translated)

设置标(SoM)提示揭示了GPT-4V的视觉 grounded 能力,通过使模型能够将图像上插入的标签与字母数字标签关联。这些标签用字母数字标示,可以通过文本标记进行索引以便于参考。尽管GPT-4V的表现非凡,但我们观察到其他多模态大型语言模型(MLLMs)很难理解这些视觉标签。为了推广为开源模型学习SoM提示,我们提出了一个新的学习范式:“一个一个列出项目”,它要求模型按照标签的字母数字顺序列出图像上所有视觉标签。通过将我们精心挑选的数据集与其他视觉指令调整数据集集成,我们使得现有的MLLM具有SoM提示能力。此外,我们在五个MLLM基准测试上评估了我们微调的SoM模型。我们发现,即使在相对较小的规模(10k-30k图像带有标签)下,这个新数据集也显著增强了视觉推理能力和减少了MLLM的幻觉。也许令人惊讶的是,即使在输入图像中省略了视觉标签时,这些改进也会持续。这表明“一个一个列出项目”作为一个新的为训练MLLM的新范式具有潜力,它通过在训练阶段使用视觉标签来加强对象与文本之间的对齐。最后,我们通过探测训练后的模型来理解SoM的工作原理。我们的代码和数据可在此处访问:https://this URL。

URL

https://arxiv.org/abs/2404.16375

PDF

https://arxiv.org/pdf/2404.16375.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot