Paper Reading AI Learner

SemanticMoments: Training-Free Motion Similarity via Third Moment Features

2026-02-09 19:47:56
Saar Huberman, Kfir Goldberg, Or Patashnik, Sagie Benaim, Ron Mokady

Abstract

Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centric inputs like optical flow lack the semantic grounding needed to understand high-level motion. To demonstrate this inherent bias, we introduce the SimMotion benchmarks, combining controlled synthetic data with a new human-annotated real-world dataset. We show that existing models perform poorly on these benchmarks, often failing to disentangle motion from appearance. To address this gap, we propose SemanticMoments, a simple, training-free method that computes temporal statistics (specifically, higher-order moments) over features from pre-trained semantic models. Across our benchmarks, SemanticMoments consistently outperforms existing RGB, flow, and text-supervised methods. This demonstrates that temporal statistics in a semantic feature space provide a scalable and perceptually grounded foundation for motion-centric video understanding.

Abstract (translated)

基于语义动作检索视频是一项基本但尚未解决的问题。现有的视频表示方法过于依赖静态外观和场景上下文,而忽视了动态动作,这种偏见源自它们的训练数据和目标设定。相比之下,传统的以动作为中心的输入(如光流)缺乏理解高层次动作所需的语义背景。为了展示这一内在偏见,我们引入了SimMotion基准测试集,它结合了受控合成数据与新的、由人类标注的真实世界数据集。我们发现现有模型在这些基准上表现不佳,经常无法将运动和外观分开。 为了解决这个差距,我们提出了SemanticMoments,这是一种简单且无需训练的方法,通过对预训练语义模型的特征计算时间统计(特别是高阶矩)。在我们的基准测试中,SemanticMoments始终优于现有的RGB、光流以及文本监督方法。这表明,在语义特征空间中的时间统计数据为以动作为中心的视频理解提供了一个可扩展且感知基础良好的框架。

URL

https://arxiv.org/abs/2602.09146

PDF

https://arxiv.org/pdf/2602.09146.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot