Abstract
Large Multimodal Models (LMMs) have achieved remarkable progress across various capabilities; however, complex video reasoning in the scientific domain remains a significant and challenging frontier. Current video benchmarks predominantly target general scenarios where perception/recognition is heavily relied on, while with relatively simple reasoning tasks, leading to saturation and thus failing to effectively evaluate advanced multimodal cognitive skills. To address this critical gap, we introduce SciVideoBench, a rigorous benchmark specifically designed to assess advanced video reasoning in scientific contexts. SciVideoBench consists of 1,000 carefully crafted multiple-choice questions derived from cutting-edge scientific experimental videos spanning over 25 specialized academic subjects and verified by a semi-automatic system. Each question demands sophisticated domain-specific knowledge, precise spatiotemporal perception, and intricate logical reasoning, effectively challenging models' higher-order cognitive abilities. Our evaluation highlights significant performance deficits in state-of-the-art proprietary and open-source LMMs, including Gemini 2.5 Pro and Qwen2.5-VL, indicating substantial room for advancement in video reasoning capabilities. Detailed analyses of critical factors such as reasoning complexity and visual grounding provide valuable insights and clear direction for future developments in LMMs, driving the evolution of truly capable multimodal AI co-scientists. We hope SciVideoBench could fit the interests of the community and help to push the boundary of cutting-edge AI for border science.
Abstract (translated)
大型多模态模型(LMM)在各种能力方面取得了显著进步;然而,复杂视频推理在科学领域仍然是一个重要的且具有挑战性的前沿问题。当前的视频基准测试主要针对感知/识别依赖性强的一般场景,并包含相对简单的推理任务,这导致了饱和状态并无法有效评估高级多模态认知技能。为了解决这一关键缺口,我们引入了SciVideoBench,这是一个专门设计用于评估科学情境中高级视频推理能力的严格基准。 SciVideoBench包括1,000个精心制作的选择题,这些问题来源于涵盖25多个专业学术领域的前沿科学实验视频,并通过半自动系统进行了验证。每个问题都需要专业知识、精确的空间时间感知以及复杂的逻辑推理,有效地挑战了模型的高阶认知能力。我们的评估揭示了最先进的专有和开源LMM(包括Gemini 2.5 Pro 和 Qwen2.5-VL)在视频推理方面存在显著性能差距,表明该领域仍有很大的进步空间。 对关键因素如推理复杂性和视觉定位进行详细分析提供了宝贵的见解,并为未来多模态模型的发展指明了清晰的方向,推动真正具备能力的多模态AI合作者的发展。我们希望SciVideoBench能够符合社区的兴趣并帮助推进前沿AI在科学领域的边界。
URL
https://arxiv.org/abs/2510.08559