Abstract
Many motion-centric video analysis tasks, such as atomic actions, detecting atypical motor behavior in individuals with autism, or analyzing articulatory motion in real-time MRI of human speech, require efficient and interpretable temporal modeling. Capturing temporal dynamics is a central challenge in video analysis, often requiring significant computational resources and fine-grained annotations that are not widely available. This paper presents MOOSE (Motion Flow Over Spatial Space), a novel temporally-centric video encoder explicitly integrating optical flow with spatial embeddings to model temporal information efficiently, inspired by human perception of motion. Unlike prior models, MOOSE takes advantage of rich, widely available pre-trained visual and optical flow encoders instead of training video models from scratch. This significantly reduces computational complexity while enhancing temporal interpretability. Our primary contributions includes (1) proposing a computationally efficient temporally-centric architecture for video understanding (2) demonstrating enhanced interpretability in modeling temporal dynamics; and (3) achieving state-of-the-art performance on diverse benchmarks, including clinical, medical, and standard action recognition datasets, confirming the broad applicability and effectiveness of our approach.
Abstract (translated)
许多以运动为中心的视频分析任务,如原子动作识别、自闭症个体中非典型性运动行为检测以及实时MRI下人类言语发音动作的分析,都需要高效的可解释的时间模型。捕捉时间动态是视频分析中的一个核心挑战,这通常需要大量的计算资源和精细标注的数据,而这些数据并不广泛可用。本文介绍了MOOSE(Motion Flow Over Spatial Space),这是一种新颖的时间为中心的视频编码器,它通过将光流与空间嵌入相结合来高效地建模时间信息,灵感来源于人类对运动的理解方式。不同于先前的模型,MOOSE利用了丰富的、广泛可用的预训练视觉和光流编码器,而不是从头开始训练视频模型。这大大减少了计算复杂性,并提高了时间可解释性。 我们的主要贡献包括: 1. 提出了一种计算效率高的以时间为中心的架构,用于视频理解; 2. 展示了在建模时间动态方面增强了可解释性;以及 3. 在各种基准测试中达到了最先进的性能,这些基准涵盖了临床、医疗和标准动作识别数据集,这证实了我们的方法具有广泛适用性和有效性。
URL
https://arxiv.org/abs/2506.01119