Paper Reading AI Learner

WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts

2025-06-18 16:09:18
Negar Foroutan, Angelika Romanou, Matin Ansaripour, Julian Martin Eisenschlos, Karl Aberer, R\'emi Lebret

Abstract

Documents are fundamental to preserving and disseminating information, often incorporating complex layouts, tables, and charts that pose significant challenges for automatic document understanding (DU). While vision-language large models (VLLMs) have demonstrated improvements across various tasks, their effectiveness in processing long-context vision inputs remains unclear. This paper introduces WikiMixQA, a benchmark comprising 1,000 multiple-choice questions (MCQs) designed to evaluate cross-modal reasoning over tables and charts extracted from 4,000 Wikipedia pages spanning seven distinct topics. Unlike existing benchmarks, WikiMixQA emphasizes complex reasoning by requiring models to synthesize information from multiple modalities. We evaluate 12 state-of-the-art vision-language models, revealing that while proprietary models achieve ~70% accuracy when provided with direct context, their performance deteriorates significantly when retrieval from long documents is required. Among these, GPT-4-o is the only model exceeding 50% accuracy in this setting, whereas open-source models perform considerably worse, with a maximum accuracy of 27%. These findings underscore the challenges of long-context, multi-modal reasoning and establish WikiMixQA as a crucial benchmark for advancing document understanding research.

Abstract (translated)

文档是保存和传播信息的基础,通常包含复杂的布局、表格和图表,这些都给自动文档理解(DU)带来了重大挑战。尽管视觉-语言大型模型(VLLMs)在各种任务上表现出改进,但它们处理长上下文视觉输入的有效性仍然不清楚。本文介绍了WikiMixQA,这是一个基准测试集,包括1000个多项选择题(MCQ),旨在评估跨模态推理能力,这些题目基于从涵盖七个不同主题的4000个维基百科页面中提取的表格和图表进行设计。与现有的基准相比,WikiMixQA通过要求模型综合来自多种模式的信息来强调复杂的推理过程。 我们对12种最先进的视觉-语言模型进行了评估,结果表明,在提供直接上下文的情况下,专有模型可以达到约70%的准确性;然而,当需要从长文档中检索信息时,这些模型的表现显著下降。在这一设置下,只有GPT-4-o这款模型的准确率超过了50%,而开源模型表现较差,最高仅能达到27%的准确率。 这些发现强调了跨模态推理和处理长上下文所带来的挑战,并确立WikiMixQA作为推进文档理解研究的重要基准测试。

URL

https://arxiv.org/abs/2506.15594

PDF

https://arxiv.org/pdf/2506.15594.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot