Abstract
We introduce SciVer, the first benchmark specifically designed to evaluate the ability of foundation models to verify claims within a multimodal scientific context. SciVer consists of 3,000 expert-annotated examples over 1,113 scientific papers, covering four subsets, each representing a common reasoning type in multimodal scientific claim verification. To enable fine-grained evaluation, each example includes expert-annotated supporting evidence. We assess the performance of 21 state-of-the-art multimodal foundation models, including o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, and Qwen2.5-VL. Our experiment reveals a substantial performance gap between these models and human experts on SciVer. Through an in-depth analysis of retrieval-augmented generation (RAG), and human-conducted error evaluations, we identify critical limitations in current open-source models, offering key insights to advance models' comprehension and reasoning in multimodal scientific literature tasks.
Abstract (translated)
我们介绍SciVer,这是首个专门设计用于评估基础模型在多模态科学背景下验证声明能力的基准测试。SciVer包含3000个由专家标注的例子,这些例子源自1,113篇科学论文,并覆盖了四种不同的子集,每个子集代表多模态科学研究中常见的推理类型。为了进行细致的评估,每一个例子都包含了由专家标注的支持证据。我们对21种最先进的多模态基础模型进行了性能评估,其中包括o4-mini、Gemini-2.5-Flash、Llama-3.2-Vision和Qwen2.5-VL。我们的实验结果显示,这些模型在SciVer上的表现与人类专家之间存在显著差距。通过深入分析增强检索生成(Retrieval-Augmented Generation, RAG)以及人工错误评估,我们识别出了当前开源模型中的关键限制,并为提升模型在多模态科学文献任务中的理解和推理能力提供了重要见解。
URL
https://arxiv.org/abs/2506.15569