Abstract
Dense video prediction tasks, such as object tracking and semantic segmentation, require video encoders that generate temporally consistent, spatially dense features for every frame. However, existing approaches fall short: image encoders like DINO or CLIP lack temporal awareness, while video models such as VideoMAE underperform compared to image encoders on dense prediction tasks. We address this gap with FRAME, a self-supervised video frame encoder tailored for dense video understanding. FRAME learns to predict current and future DINO patch features from past and present RGB frames, leading to spatially precise and temporally coherent representations. To our knowledge, FRAME is the first video encoder to leverage image-based models for dense prediction while outperforming them on tasks requiring fine-grained visual correspondence. As an auxiliary capability, FRAME aligns its class token with CLIP's semantic space, supporting language-driven tasks such as video classification. We evaluate FRAME across six dense prediction tasks on seven datasets, where it consistently outperforms image encoders and existing self-supervised video models. Despite its versatility, FRAME maintains a compact architecture suitable for a range of downstream applications.
Abstract (translated)
密集视频预测任务,如物体跟踪和语义分割,需要生成每一帧上具有时间一致性且空间稠密特征的视频编码器。然而,现有方法存在不足:图像编码器(例如DINO或CLIP)缺乏对时间信息的理解,而诸如VideoMAE之类的视频模型在密集预测任务中表现不如图像编码器。我们通过引入FRAME来填补这一空白,这是一种针对密集视频理解自我监督视频帧编码器。FRAME学习从过去和当前的RGB帧中预测当前及未来的DINO补丁特征,从而生成空间上精确且时间上一致的表示。据我们所知,FRAME是第一个利用基于图像模型进行密集预测并超越它们在需要细粒度视觉对应的任务上的表现的视频编码器。作为辅助能力,FRAME将其类别令牌与CLIP的语义空间对齐,支持如视频分类等语言驱动任务。我们在六个不同的密集预测任务上使用七个数据集评估了FRAME,并发现其始终优于图像编码器和现有的自我监督视频模型。尽管具备多功能性,但FRAME仍保持紧凑型架构,适用于各种下游应用。
URL
https://arxiv.org/abs/2506.05543