SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

Abstract
Abstract (translated)
URL
PDF

Abstract

Comprehending text-rich visual content is paramount for the practical application of Multimodal Large Language Models (MLLMs), since text-rich scenarios are ubiquitous in the real world, which are characterized by the presence of extensive texts embedded within images. Recently, the advent of MLLMs with impressive versatility has raised the bar for what we can expect from MLLMs. However, their proficiency in text-rich scenarios has yet to be comprehensively and objectively assessed, since current MLLM benchmarks primarily focus on evaluating general visual comprehension. In this work, we introduce SEED-Bench-2-Plus, a benchmark specifically designed for evaluating \textbf{text-rich visual comprehension} of MLLMs. Our benchmark comprises 2.3K multiple-choice questions with precise human annotations, spanning three broad categories: Charts, Maps, and Webs, each of which covers a wide spectrum of text-rich scenarios in the real world. These categories, due to their inherent complexity and diversity, effectively simulate real-world text-rich environments. We further conduct a thorough evaluation involving 34 prominent MLLMs (including GPT-4V, Gemini-Pro-Vision and Claude-3-Opus) and emphasize the current limitations of MLLMs in text-rich visual comprehension. We hope that our work can serve as a valuable addition to existing MLLM benchmarks, providing insightful observations and inspiring further research in the area of text-rich visual comprehension with MLLMs. The dataset and evaluation code can be accessed at this https URL.

Abstract (translated)

理解丰富文本的视觉内容对于多模态大型语言模型的实际应用至关重要，因为这种场景在现实生活中随处可见，特点是图像中嵌入大量文本。近年来，具有令人印象深刻的多样性的MLLM的出现提高了我们对MLLM的期望，然而，对于这些MLLM在丰富文本场景中的表现，我们还没有进行全面的、客观的评估，因为目前的MLLM基准主要关注评估通用视觉理解。在本文中，我们介绍了SEED-Bench-2-Plus，一个专门为评估MLLM的丰富文本视觉理解而设计的基准。我们的基准包括2300多个多选题问题，带有精确的人类注释，涵盖了三个广泛的类别：图表、地图和网站，每个类别涵盖了现实世界中的广泛文本丰富场景。由于它们的固有复杂性和多样性，这些类别有效地模拟了现实世界的文本丰富环境。我们进一步对34个著名的MLLM（包括GPT-4V、Gemini-Pro-Vision和Claude-3-Opus）进行了深入评估，并强调了MLLM在丰富文本视觉理解方面的当前局限性。我们希望我们的工作能为现有的MLLM基准提供宝贵的补充，提供有关丰富文本视觉理解与MLLM的进一步研究，以及有益的观察。数据和评估代码可以在此链接访问：https://url.in/

URL

https://arxiv.org/abs/2404.16790

PDF

https://arxiv.org/pdf/2404.16790.pdf

SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

Abstract

Abstract (translated)

URL

PDF Copy

PDF