Paper Reading AI Learner

FRAME: Pre-Training Video Feature Representations via Anticipation and Memory

2025-06-05 19:44:47
Sethuraman TV, Savya Khosla, Vignesh Srinivasakumar, Jiahui Huang, Seoung Wug Oh, Simon Jenni, Derek Hoiem, Joon-Young Lee

Abstract

Dense video prediction tasks, such as object tracking and semantic segmentation, require video encoders that generate temporally consistent, spatially dense features for every frame. However, existing approaches fall short: image encoders like DINO or CLIP lack temporal awareness, while video models such as VideoMAE underperform compared to image encoders on dense prediction tasks. We address this gap with FRAME, a self-supervised video frame encoder tailored for dense video understanding. FRAME learns to predict current and future DINO patch features from past and present RGB frames, leading to spatially precise and temporally coherent representations. To our knowledge, FRAME is the first video encoder to leverage image-based models for dense prediction while outperforming them on tasks requiring fine-grained visual correspondence. As an auxiliary capability, FRAME aligns its class token with CLIP's semantic space, supporting language-driven tasks such as video classification. We evaluate FRAME across six dense prediction tasks on seven datasets, where it consistently outperforms image encoders and existing self-supervised video models. Despite its versatility, FRAME maintains a compact architecture suitable for a range of downstream applications.

Abstract (translated)

密集视频预测任务,如物体跟踪和语义分割,需要生成每一帧上具有时间一致性且空间稠密特征的视频编码器。然而,现有方法存在不足:图像编码器(例如DINO或CLIP)缺乏对时间信息的理解,而诸如VideoMAE之类的视频模型在密集预测任务中表现不如图像编码器。我们通过引入FRAME来填补这一空白,这是一种针对密集视频理解自我监督视频帧编码器。FRAME学习从过去和当前的RGB帧中预测当前及未来的DINO补丁特征,从而生成空间上精确且时间上一致的表示。据我们所知,FRAME是第一个利用基于图像模型进行密集预测并超越它们在需要细粒度视觉对应的任务上的表现的视频编码器。作为辅助能力,FRAME将其类别令牌与CLIP的语义空间对齐,支持如视频分类等语言驱动任务。我们在六个不同的密集预测任务上使用七个数据集评估了FRAME,并发现其始终优于图像编码器和现有的自我监督视频模型。尽管具备多功能性,但FRAME仍保持紧凑型架构,适用于各种下游应用。

URL

https://arxiv.org/abs/2506.05543

PDF

https://arxiv.org/pdf/2506.05543.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot