Abstract
Documents are fundamental to preserving and disseminating information, often incorporating complex layouts, tables, and charts that pose significant challenges for automatic document understanding (DU). While vision-language large models (VLLMs) have demonstrated improvements across various tasks, their effectiveness in processing long-context vision inputs remains unclear. This paper introduces WikiMixQA, a benchmark comprising 1,000 multiple-choice questions (MCQs) designed to evaluate cross-modal reasoning over tables and charts extracted from 4,000 Wikipedia pages spanning seven distinct topics. Unlike existing benchmarks, WikiMixQA emphasizes complex reasoning by requiring models to synthesize information from multiple modalities. We evaluate 12 state-of-the-art vision-language models, revealing that while proprietary models achieve ~70% accuracy when provided with direct context, their performance deteriorates significantly when retrieval from long documents is required. Among these, GPT-4-o is the only model exceeding 50% accuracy in this setting, whereas open-source models perform considerably worse, with a maximum accuracy of 27%. These findings underscore the challenges of long-context, multi-modal reasoning and establish WikiMixQA as a crucial benchmark for advancing document understanding research.
Abstract (translated)
文档是保存和传播信息的基础,通常包含复杂的布局、表格和图表,这些都给自动文档理解(DU)带来了重大挑战。尽管视觉-语言大型模型(VLLMs)在各种任务上表现出改进,但它们处理长上下文视觉输入的有效性仍然不清楚。本文介绍了WikiMixQA,这是一个基准测试集,包括1000个多项选择题(MCQ),旨在评估跨模态推理能力,这些题目基于从涵盖七个不同主题的4000个维基百科页面中提取的表格和图表进行设计。与现有的基准相比,WikiMixQA通过要求模型综合来自多种模式的信息来强调复杂的推理过程。 我们对12种最先进的视觉-语言模型进行了评估,结果表明,在提供直接上下文的情况下,专有模型可以达到约70%的准确性;然而,当需要从长文档中检索信息时,这些模型的表现显著下降。在这一设置下,只有GPT-4-o这款模型的准确率超过了50%,而开源模型表现较差,最高仅能达到27%的准确率。 这些发现强调了跨模态推理和处理长上下文所带来的挑战,并确立WikiMixQA作为推进文档理解研究的重要基准测试。
URL
https://arxiv.org/abs/2506.15594