Abstract
We introduce CausalVQA, a benchmark dataset for video question answering (VQA) composed of question-answer pairs that probe models' understanding of causality in the physical world. Existing VQA benchmarks either tend to focus on surface perceptual understanding of real-world videos, or on narrow physical reasoning questions created using simulation environments. CausalVQA fills an important gap by presenting challenging questions that are grounded in real-world scenarios, while focusing on models' ability to predict the likely outcomes of different actions and events through five question types: counterfactual, hypothetical, anticipation, planning and descriptive. We designed quality control mechanisms that prevent models from exploiting trivial shortcuts, requiring models to base their answers on deep visual understanding instead of linguistic cues. We find that current frontier multimodal models fall substantially below human performance on the benchmark, especially on anticipation and hypothetical questions. This highlights a challenge for current systems to leverage spatial-temporal reasoning, understanding of physical principles, and comprehension of possible alternatives to make accurate predictions in real-world settings.
Abstract (translated)
我们介绍了一个名为CausalVQA的基准数据集,该数据集用于视频问答(Video Question Answering, VQA),其中包含探查模型对物理世界因果关系理解的问题-答案对。现有的VQA基准要么侧重于实际视频表面感知的理解,要么专注于使用模拟环境创建的狭窄物理推理问题。CausalVQA通过提出基于现实场景、挑战性的五类问题(反事实、假设性、预期、计划和描述性),填补了这一重要空白,这些问题聚焦于模型预测不同行动和事件可能结果的能力。我们设计了质量控制机制以防止模型利用简单的捷径,要求模型的答案必须建立在深层视觉理解之上而非语言线索。 我们发现当前前沿的多模态模型在该基准上的表现显著低于人类水平,特别是在预期和假设性问题上。这表明目前系统面临着如何充分利用时空推理能力、物理原理的理解以及对可能替代方案的理解来做出准确预测的挑战,尤其是在现实世界场景中。
URL
https://arxiv.org/abs/2506.09943