Paper Reading AI Learner

Recurrent Video Masked Autoencoders

2025-12-15 18:59:48
Daniel Zoran, Nikhil Parthasarathy, Yi Yang, Drew A Hudson, Joao Carreira, Andrew Zisserman

Abstract

We present Recurrent Video Masked-Autoencoders (RVM): a novel video representation learning approach that uses a transformer-based recurrent neural network to aggregate dense image features over time, effectively capturing the spatio-temporal structure of natural video data. RVM learns via an asymmetric masked prediction task requiring only a standard pixel reconstruction objective. This design yields a highly efficient ``generalist'' encoder: RVM achieves competitive performance with state-of-the-art video models (e.g. VideoMAE, V-JEPA) on video-level tasks like action recognition and point/object tracking, while also performing favorably against image models (e.g. DINOv2) on tasks that test geometric and dense spatial understanding. Notably, RVM achieves strong performance in the small-model regime without requiring knowledge distillation, exhibiting up to 30x greater parameter efficiency than competing video masked autoencoders. Moreover, we demonstrate that RVM's recurrent nature allows for stable feature propagation over long temporal horizons with linear computational cost, overcoming some of the limitations of standard spatio-temporal attention-based architectures. Finally, we use qualitative visualizations to highlight that RVM learns rich representations of scene semantics, structure, and motion.

Abstract (translated)

我们提出了递归视频掩码自编码器(RVM):一种新颖的视频表示学习方法,它使用基于变压器的循环神经网络来在时间维度上聚合密集图像特征,从而有效地捕捉自然视频数据中的空间-时间结构。RVM 通过一个非对称的掩码预测任务进行学习,该任务只需要标准像素重建目标即可完成。这种设计产生了一个高效的“通才”编码器:RVM 在诸如动作识别和点/对象跟踪等视频级别的任务上表现出与最先进的视频模型(如 VideoMAE、V-JEPA)相当的性能,并且在测试几何学和密集空间理解的任务中,其表现也优于图像模型(例如 DINOv2)。值得注意的是,即使不使用知识蒸馏,在小规模模型环境下 RVM 也能取得优异的成绩,比竞争中的视频掩码自编码器参数效率高出多达30倍。此外,我们证明了由于 RVM 的递归特性,它能够在长时序范围内以线性计算成本稳定地传播特征,克服了一些基于空间-时间注意的架构的限制。最后,我们通过定性的可视化展示了 RVM 学习到了丰富的场景语义、结构和运动表示。

URL

https://arxiv.org/abs/2512.13684

PDF

https://arxiv.org/pdf/2512.13684.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot