Paper Reading AI Learner

SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification

2025-06-18 15:43:26
Chengye Wang, Yifei Shen, Zexi Kuang, Arman Cohan, Yilun Zhao

Abstract

We introduce SciVer, the first benchmark specifically designed to evaluate the ability of foundation models to verify claims within a multimodal scientific context. SciVer consists of 3,000 expert-annotated examples over 1,113 scientific papers, covering four subsets, each representing a common reasoning type in multimodal scientific claim verification. To enable fine-grained evaluation, each example includes expert-annotated supporting evidence. We assess the performance of 21 state-of-the-art multimodal foundation models, including o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, and Qwen2.5-VL. Our experiment reveals a substantial performance gap between these models and human experts on SciVer. Through an in-depth analysis of retrieval-augmented generation (RAG), and human-conducted error evaluations, we identify critical limitations in current open-source models, offering key insights to advance models' comprehension and reasoning in multimodal scientific literature tasks.

Abstract (translated)

我们介绍SciVer,这是首个专门设计用于评估基础模型在多模态科学背景下验证声明能力的基准测试。SciVer包含3000个由专家标注的例子,这些例子源自1,113篇科学论文,并覆盖了四种不同的子集,每个子集代表多模态科学研究中常见的推理类型。为了进行细致的评估,每一个例子都包含了由专家标注的支持证据。我们对21种最先进的多模态基础模型进行了性能评估,其中包括o4-mini、Gemini-2.5-Flash、Llama-3.2-Vision和Qwen2.5-VL。我们的实验结果显示,这些模型在SciVer上的表现与人类专家之间存在显著差距。通过深入分析增强检索生成(Retrieval-Augmented Generation, RAG)以及人工错误评估,我们识别出了当前开源模型中的关键限制,并为提升模型在多模态科学文献任务中的理解和推理能力提供了重要见解。

URL

https://arxiv.org/abs/2506.15569

PDF

https://arxiv.org/pdf/2506.15569.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot