Abstract
Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not better than a simple strategy based on majority voting over context examples. Moreover, we identify several biases and limitations of M-ICL that warrant consideration prior to deployment. Code available at this https URL}{this http URL
Abstract (translated)
大规模语言模型在各种任务上的表现令人印象深刻,展示了通过最小演示示例的In-Context学习(ICL)迅速获取新技能的能力。在这项工作中,我们提出了一个全面研究大型多模态模型中多模态ICL(M-ICL)的框架。我们考虑了最好的开源多模态模型(如IDEFICS,OpenFlamingo)以及多种多模态任务。我们的研究揭示了几个值得关注的研究结果:(1)M-ICL主要依赖于文本驱动的机制,对图像模态的影响较小。(2)当使用先进的ICL策略(如RICES)时,M-ICL并不比基于多数投票的简单策略好。此外,我们识别出M-ICL的一些偏见和局限性,这些都应该在部署前进行考虑。代码可在此处访问:<https://this URL>
URL
https://arxiv.org/abs/2404.15736