What Makes Multimodal In-Context Learning Work?

2024-04-24 08:50:45

Folco Bertini Baldassini, Mustafa Shukor, Matthieu Cord, Laure Soulier, Benjamin Piwowarski

arXiv_CV

arXiv_CV Language_Model

Abstract
Abstract (translated)
URL
PDF

Abstract

Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not better than a simple strategy based on majority voting over context examples. Moreover, we identify several biases and limitations of M-ICL that warrant consideration prior to deployment. Code available at this https URL}{this http URL

Abstract (translated)

大规模语言模型在各种任务上的表现令人印象深刻，展示了通过最小演示示例的In-Context学习（ICL）迅速获取新技能的能力。在这项工作中，我们提出了一个全面研究大型多模态模型中多模态ICL（M-ICL）的框架。我们考虑了最好的开源多模态模型（如IDEFICS，OpenFlamingo）以及多种多模态任务。我们的研究揭示了几个值得关注的研究结果：（1）M-ICL主要依赖于文本驱动的机制，对图像模态的影响较小。（2）当使用先进的ICL策略（如RICES）时，M-ICL并不比基于多数投票的简单策略好。此外，我们识别出M-ICL的一些偏见和局限性，这些都应该在部署前进行考虑。代码可在此处访问：<https://this URL>

URL

https://arxiv.org/abs/2404.15736

PDF

https://arxiv.org/pdf/2404.15736.pdf

What Makes Multimodal In-Context Learning Work?

Abstract

Abstract (translated)

URL

PDF Copy

PDF