Paper Reading AI Learner

CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models

2025-06-11 17:10:36
Aaron Foss, Chloe Evans, Sasha Mitts, Koustuv Sinha, Ammar Rizvi, Justine T. Kao

Abstract

We introduce CausalVQA, a benchmark dataset for video question answering (VQA) composed of question-answer pairs that probe models' understanding of causality in the physical world. Existing VQA benchmarks either tend to focus on surface perceptual understanding of real-world videos, or on narrow physical reasoning questions created using simulation environments. CausalVQA fills an important gap by presenting challenging questions that are grounded in real-world scenarios, while focusing on models' ability to predict the likely outcomes of different actions and events through five question types: counterfactual, hypothetical, anticipation, planning and descriptive. We designed quality control mechanisms that prevent models from exploiting trivial shortcuts, requiring models to base their answers on deep visual understanding instead of linguistic cues. We find that current frontier multimodal models fall substantially below human performance on the benchmark, especially on anticipation and hypothetical questions. This highlights a challenge for current systems to leverage spatial-temporal reasoning, understanding of physical principles, and comprehension of possible alternatives to make accurate predictions in real-world settings.

Abstract (translated)

我们介绍了一个名为CausalVQA的基准数据集,该数据集用于视频问答(Video Question Answering, VQA),其中包含探查模型对物理世界因果关系理解的问题-答案对。现有的VQA基准要么侧重于实际视频表面感知的理解,要么专注于使用模拟环境创建的狭窄物理推理问题。CausalVQA通过提出基于现实场景、挑战性的五类问题(反事实、假设性、预期、计划和描述性),填补了这一重要空白,这些问题聚焦于模型预测不同行动和事件可能结果的能力。我们设计了质量控制机制以防止模型利用简单的捷径,要求模型的答案必须建立在深层视觉理解之上而非语言线索。 我们发现当前前沿的多模态模型在该基准上的表现显著低于人类水平,特别是在预期和假设性问题上。这表明目前系统面临着如何充分利用时空推理能力、物理原理的理解以及对可能替代方案的理解来做出准确预测的挑战,尤其是在现实世界场景中。

URL

https://arxiv.org/abs/2506.09943

PDF

https://arxiv.org/pdf/2506.09943.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot