Abstract
Despite the advancements and impressive performance of Multimodal Large Language Models (MLLMs) on benchmarks, their effectiveness in real-world, long-context, and multi-image tasks is unclear due to the benchmarks' limited scope. Existing benchmarks often focus on single-image and short-text samples, and when assessing multi-image tasks, they either limit the image count or focus on specific task (e.g time-series captioning), potentially obscuring the performance challenges of MLLMs. To address these limitations, we introduce MileBench, a pioneering benchmark designed to test the MultImodal Long-contExt capabilities of MLLMs. This benchmark comprises not only multimodal long contexts, but also multiple tasks requiring both comprehension and generation. We establish two distinct evaluation sets, diagnostic and realistic, to systematically assess MLLMs' long-context adaptation capacity and their ability to complete tasks in long-context scenarios. Our experimental results, obtained from testing 20 models, revealed that while the closed-source GPT-4(Vision) and Gemini 1.5 outperform others, most open-source MLLMs struggle in long-context situations. Interestingly, the performance gap tends to widen with an increase in the number of images. We strongly encourage an intensification of research efforts towards enhancing MLLMs' long-context capabilities, especially in scenarios involving multiple images.
Abstract (translated)
尽管在基准测试中,多模态大型语言模型(MLLMs)的进步和出色的表现非常明显,但在现实世界、长文本和多图像任务中,它们在实际应用中的有效性尚不清楚,因为基准测试的范围有限。现有的基准测试通常仅关注单张图像和较短文本样本,当评估多图像任务时,它们可能限制图像数量或将注意力集中在特定任务(例如时间序列摘要),这可能会掩盖MLLMs在长文本场景中的性能挑战。为了应对这些限制,我们引入了MileBench,一个旨在测试MLLMs多模态长文本能力的开创性基准。该基准不仅包括多模态长文本,还包括需要理解和生成的多个任务。我们建立了两个不同的评估集,分别是诊断和现实主义评估,以系统地评估MLLMs在长文本场景中的长文本适应能力和完成任务的能力。我们对20个模型进行了实验,结果表明,虽然闭源的GPT-4(视觉)和Gemini 1.5的表现最好,但大多数开源MLLM在长文本环境中表现不佳。有趣的是,随着图像数量的增加,性能差距往往扩大。我们强烈鼓励在涉及多个图像的场景中加强研究努力,以提高MLLMs的长文本能力。
URL
https://arxiv.org/abs/2404.18532