Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring

Abstract
Abstract (translated)
URL
PDF

Abstract

Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs, thus attracting increasing attention for their potential to improve visual representation learning in the video domain. In this paper, based on the CLIP model, we revisit temporal modeling in the context of image-to-video knowledge transferring, which is the key point for extending image-text pretrained models to the video domain. We find that current temporal modeling mechanisms are tailored to either high-level semantic-dominant tasks (e.g., retrieval) or low-level visual pattern-dominant tasks (e.g., recognition), and fail to work on the two cases simultaneously. The key difficulty lies in modeling temporal dependency while taking advantage of both high-level and low-level knowledge in CLIP model. To tackle this problem, we present Spatial-Temporal Auxiliary Network (STAN) -- a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks. Specifically, to realize both low-level and high-level knowledge transferring, STAN adopts a branch structure with decomposed spatial-temporal modules that enable multi-level CLIP features to be spatial-temporally contextualized. We evaluate our method on two representative video tasks: Video-Text Retrieval and Video Recognition. Extensive experiments demonstrate the superiority of our model over the state-of-the-art methods on various datasets, including MSR-VTT, DiDeMo, LSMDC, MSVD, Kinetics-400, and Something-Something-V2. Codes will be available at this https URL

Abstract (translated)

图像-文本预训练模型,例如 CLIP,从大规模图像-文本数据对中学习了令人印象深刻的普遍多模态知识,因此吸引了越来越多的注意力,因为它们有可能改善视频领域的视觉表示学习。在本文中,基于 CLIP 模型,我们回顾了图像-视频知识传输背景下的时间建模,这是将图像-文本预训练模型扩展到视频领域的关键点。我们发现,当前的时间建模机制针对高水平的语义 dominant 任务(例如,检索)或低水平的视觉模式 dominant 任务(例如,识别),无法同时处理两个情况。主要难点在于同时建模时间依赖关系,同时利用 CLIP 模型中的高水平和低水平知识。为了解决这个问题,我们提出了 Spatial-Temporal Auxiliary Network (STAN) - 一种简单有效的时间建模机制,将 CLIP 模型扩展到不同的视频任务。具体来说,为了实现高水平的和低水平知识传输,stan 采用分解空间时间的分支结构,使多个 CLIP 特征的空间时间上下文化。我们针对两个代表性视频任务:视频-文本检索和视频识别进行了评估。广泛的实验表明,我们的模型在包括 MSR-VTT、DiDeMo、LSMDC、MSVD、Kinetics-400 和 something-something-V2 等多种数据集上优于最先进的方法。代码将在 this https URL 中提供。

URL

https://arxiv.org/abs/2301.11116

PDF

https://arxiv.org/pdf/2301.11116.pdf