Paper Reading AI Learner

MileBench: Benchmarking MLLMs in Long Context

2024-04-29 09:19:05
Dingjie Song, Shunian Chen, Guiming Hardy Chen, Fei Yu, Xiang Wan, Benyou Wang

Abstract

Despite the advancements and impressive performance of Multimodal Large Language Models (MLLMs) on benchmarks, their effectiveness in real-world, long-context, and multi-image tasks is unclear due to the benchmarks' limited scope. Existing benchmarks often focus on single-image and short-text samples, and when assessing multi-image tasks, they either limit the image count or focus on specific task (e.g time-series captioning), potentially obscuring the performance challenges of MLLMs. To address these limitations, we introduce MileBench, a pioneering benchmark designed to test the MultImodal Long-contExt capabilities of MLLMs. This benchmark comprises not only multimodal long contexts, but also multiple tasks requiring both comprehension and generation. We establish two distinct evaluation sets, diagnostic and realistic, to systematically assess MLLMs' long-context adaptation capacity and their ability to complete tasks in long-context scenarios. Our experimental results, obtained from testing 20 models, revealed that while the closed-source GPT-4(Vision) and Gemini 1.5 outperform others, most open-source MLLMs struggle in long-context situations. Interestingly, the performance gap tends to widen with an increase in the number of images. We strongly encourage an intensification of research efforts towards enhancing MLLMs' long-context capabilities, especially in scenarios involving multiple images.

Abstract (translated)

尽管在基准测试中,多模态大型语言模型(MLLMs)的进步和出色的表现非常明显,但在现实世界、长文本和多图像任务中,它们在实际应用中的有效性尚不清楚,因为基准测试的范围有限。现有的基准测试通常仅关注单张图像和较短文本样本,当评估多图像任务时,它们可能限制图像数量或将注意力集中在特定任务(例如时间序列摘要),这可能会掩盖MLLMs在长文本场景中的性能挑战。为了应对这些限制,我们引入了MileBench,一个旨在测试MLLMs多模态长文本能力的开创性基准。该基准不仅包括多模态长文本,还包括需要理解和生成的多个任务。我们建立了两个不同的评估集,分别是诊断和现实主义评估,以系统地评估MLLMs在长文本场景中的长文本适应能力和完成任务的能力。我们对20个模型进行了实验,结果表明,虽然闭源的GPT-4(视觉)和Gemini 1.5的表现最好,但大多数开源MLLM在长文本环境中表现不佳。有趣的是,随着图像数量的增加,性能差距往往扩大。我们强烈鼓励在涉及多个图像的场景中加强研究努力,以提高MLLMs的长文本能力。

URL

https://arxiv.org/abs/2404.18532

PDF

https://arxiv.org/pdf/2404.18532.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot