Abstract
Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centric inputs like optical flow lack the semantic grounding needed to understand high-level motion. To demonstrate this inherent bias, we introduce the SimMotion benchmarks, combining controlled synthetic data with a new human-annotated real-world dataset. We show that existing models perform poorly on these benchmarks, often failing to disentangle motion from appearance. To address this gap, we propose SemanticMoments, a simple, training-free method that computes temporal statistics (specifically, higher-order moments) over features from pre-trained semantic models. Across our benchmarks, SemanticMoments consistently outperforms existing RGB, flow, and text-supervised methods. This demonstrates that temporal statistics in a semantic feature space provide a scalable and perceptually grounded foundation for motion-centric video understanding.
Abstract (translated)
基于语义动作检索视频是一项基本但尚未解决的问题。现有的视频表示方法过于依赖静态外观和场景上下文,而忽视了动态动作,这种偏见源自它们的训练数据和目标设定。相比之下,传统的以动作为中心的输入(如光流)缺乏理解高层次动作所需的语义背景。为了展示这一内在偏见,我们引入了SimMotion基准测试集,它结合了受控合成数据与新的、由人类标注的真实世界数据集。我们发现现有模型在这些基准上表现不佳,经常无法将运动和外观分开。 为了解决这个差距,我们提出了SemanticMoments,这是一种简单且无需训练的方法,通过对预训练语义模型的特征计算时间统计(特别是高阶矩)。在我们的基准测试中,SemanticMoments始终优于现有的RGB、光流以及文本监督方法。这表明,在语义特征空间中的时间统计数据为以动作为中心的视频理解提供了一个可扩展且感知基础良好的框架。
URL
https://arxiv.org/abs/2602.09146