Paper Reading AI Learner

MOOSE: Pay Attention to Temporal Dynamics for Video Understanding via Optical Flows

2025-06-01 18:53:27
Hong Nguyen, Dung Tran, Hieu Hoang, Phong Nguyen, Shrikanth Narayanan

Abstract

Many motion-centric video analysis tasks, such as atomic actions, detecting atypical motor behavior in individuals with autism, or analyzing articulatory motion in real-time MRI of human speech, require efficient and interpretable temporal modeling. Capturing temporal dynamics is a central challenge in video analysis, often requiring significant computational resources and fine-grained annotations that are not widely available. This paper presents MOOSE (Motion Flow Over Spatial Space), a novel temporally-centric video encoder explicitly integrating optical flow with spatial embeddings to model temporal information efficiently, inspired by human perception of motion. Unlike prior models, MOOSE takes advantage of rich, widely available pre-trained visual and optical flow encoders instead of training video models from scratch. This significantly reduces computational complexity while enhancing temporal interpretability. Our primary contributions includes (1) proposing a computationally efficient temporally-centric architecture for video understanding (2) demonstrating enhanced interpretability in modeling temporal dynamics; and (3) achieving state-of-the-art performance on diverse benchmarks, including clinical, medical, and standard action recognition datasets, confirming the broad applicability and effectiveness of our approach.

Abstract (translated)

许多以运动为中心的视频分析任务,如原子动作识别、自闭症个体中非典型性运动行为检测以及实时MRI下人类言语发音动作的分析,都需要高效的可解释的时间模型。捕捉时间动态是视频分析中的一个核心挑战,这通常需要大量的计算资源和精细标注的数据,而这些数据并不广泛可用。本文介绍了MOOSE(Motion Flow Over Spatial Space),这是一种新颖的时间为中心的视频编码器,它通过将光流与空间嵌入相结合来高效地建模时间信息,灵感来源于人类对运动的理解方式。不同于先前的模型,MOOSE利用了丰富的、广泛可用的预训练视觉和光流编码器,而不是从头开始训练视频模型。这大大减少了计算复杂性,并提高了时间可解释性。 我们的主要贡献包括: 1. 提出了一种计算效率高的以时间为中心的架构,用于视频理解; 2. 展示了在建模时间动态方面增强了可解释性;以及 3. 在各种基准测试中达到了最先进的性能,这些基准涵盖了临床、医疗和标准动作识别数据集,这证实了我们的方法具有广泛适用性和有效性。

URL

https://arxiv.org/abs/2506.01119

PDF

https://arxiv.org/pdf/2506.01119.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot