Paper Reading AI Learner

SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

2024-04-25 17:39:35
Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, Ying Shan

Abstract

Comprehending text-rich visual content is paramount for the practical application of Multimodal Large Language Models (MLLMs), since text-rich scenarios are ubiquitous in the real world, which are characterized by the presence of extensive texts embedded within images. Recently, the advent of MLLMs with impressive versatility has raised the bar for what we can expect from MLLMs. However, their proficiency in text-rich scenarios has yet to be comprehensively and objectively assessed, since current MLLM benchmarks primarily focus on evaluating general visual comprehension. In this work, we introduce SEED-Bench-2-Plus, a benchmark specifically designed for evaluating \textbf{text-rich visual comprehension} of MLLMs. Our benchmark comprises 2.3K multiple-choice questions with precise human annotations, spanning three broad categories: Charts, Maps, and Webs, each of which covers a wide spectrum of text-rich scenarios in the real world. These categories, due to their inherent complexity and diversity, effectively simulate real-world text-rich environments. We further conduct a thorough evaluation involving 34 prominent MLLMs (including GPT-4V, Gemini-Pro-Vision and Claude-3-Opus) and emphasize the current limitations of MLLMs in text-rich visual comprehension. We hope that our work can serve as a valuable addition to existing MLLM benchmarks, providing insightful observations and inspiring further research in the area of text-rich visual comprehension with MLLMs. The dataset and evaluation code can be accessed at this https URL.

Abstract (translated)

理解丰富文本的视觉内容对于多模态大型语言模型的实际应用至关重要,因为这种场景在现实生活中随处可见,特点是图像中嵌入大量文本。近年来,具有令人印象深刻的多样性的MLLM的出现提高了我们对MLLM的期望,然而,对于这些MLLM在丰富文本场景中的表现,我们还没有进行全面的、客观的评估,因为目前的MLLM基准主要关注评估通用视觉理解。在本文中,我们介绍了SEED-Bench-2-Plus,一个专门为评估MLLM的丰富文本视觉理解而设计的基准。我们的基准包括2300多个多选题问题,带有精确的人类注释,涵盖了三个广泛的类别:图表、地图和网站,每个类别涵盖了现实世界中的广泛文本丰富场景。由于它们的固有复杂性和多样性,这些类别有效地模拟了现实世界的文本丰富环境。我们进一步对34个著名的MLLM(包括GPT-4V、Gemini-Pro-Vision和Claude-3-Opus)进行了深入评估,并强调了MLLM在丰富文本视觉理解方面的当前局限性。我们希望我们的工作能为现有的MLLM基准提供宝贵的补充,提供有关丰富文本视觉理解与MLLM的进一步研究,以及有益的观察。数据和评估代码可以在此链接访问:https://url.in/

URL

https://arxiv.org/abs/2404.16790

PDF

https://arxiv.org/pdf/2404.16790.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot