List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

Abstract
Abstract (translated)
URL
PDF

Abstract

Set-of-Mark (SoM) Prompting unleashes the visual grounding capability of GPT-4V, by enabling the model to associate visual objects with tags inserted on the image. These tags, marked with alphanumerics, can be indexed via text tokens for easy reference. Despite the extraordinary performance from GPT-4V, we observe that other Multimodal Large Language Models (MLLMs) struggle to understand these visual tags. To promote the learning of SoM prompting for open-source models, we propose a new learning paradigm: "list items one by one," which asks the model to enumerate and describe all visual tags placed on the image following the alphanumeric orders of tags. By integrating our curated dataset with other visual instruction tuning datasets, we are able to equip existing MLLMs with the SoM prompting ability. Furthermore, we evaluate our finetuned SoM models on five MLLM benchmarks. We find that this new dataset, even in a relatively small size (10k-30k images with tags), significantly enhances visual reasoning capabilities and reduces hallucinations for MLLMs. Perhaps surprisingly, these improvements persist even when the visual tags are omitted from input images during inference. This suggests the potential of "list items one by one" as a new paradigm for training MLLMs, which strengthens the object-text alignment through the use of visual tags in the training stage. Finally, we conduct analyses by probing trained models to understand the working mechanism of SoM. Our code and data are available at \url{this https URL}.

Abstract (translated)

设置标（SoM）提示揭示了GPT-4V的视觉 grounded 能力，通过使模型能够将图像上插入的标签与字母数字标签关联。这些标签用字母数字标示，可以通过文本标记进行索引以便于参考。尽管GPT-4V的表现非凡，但我们观察到其他多模态大型语言模型（MLLMs）很难理解这些视觉标签。为了推广为开源模型学习SoM提示，我们提出了一个新的学习范式：“一个一个列出项目”，它要求模型按照标签的字母数字顺序列出图像上所有视觉标签。通过将我们精心挑选的数据集与其他视觉指令调整数据集集成，我们使得现有的MLLM具有SoM提示能力。此外，我们在五个MLLM基准测试上评估了我们微调的SoM模型。我们发现，即使在相对较小的规模（10k-30k图像带有标签）下，这个新数据集也显著增强了视觉推理能力和减少了MLLM的幻觉。也许令人惊讶的是，即使在输入图像中省略了视觉标签时，这些改进也会持续。这表明“一个一个列出项目”作为一个新的为训练MLLM的新范式具有潜力，它通过在训练阶段使用视觉标签来加强对象与文本之间的对齐。最后，我们通过探测训练后的模型来理解SoM的工作原理。我们的代码和数据可在此处访问：https://this URL。

URL

https://arxiv.org/abs/2404.16375

PDF

https://arxiv.org/pdf/2404.16375.pdf

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

Abstract

Abstract (translated)

URL

PDF Copy

PDF