Paper Reading AI Learner

Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring

2023-01-26 14:12:02
Ruyang Liu, Jingjia Huang, Ge Li, Jiashi Feng, Xinglong Wu, Thomas H. Li

Abstract

Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs, thus attracting increasing attention for their potential to improve visual representation learning in the video domain. In this paper, based on the CLIP model, we revisit temporal modeling in the context of image-to-video knowledge transferring, which is the key point for extending image-text pretrained models to the video domain. We find that current temporal modeling mechanisms are tailored to either high-level semantic-dominant tasks (e.g., retrieval) or low-level visual pattern-dominant tasks (e.g., recognition), and fail to work on the two cases simultaneously. The key difficulty lies in modeling temporal dependency while taking advantage of both high-level and low-level knowledge in CLIP model. To tackle this problem, we present Spatial-Temporal Auxiliary Network (STAN) -- a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks. Specifically, to realize both low-level and high-level knowledge transferring, STAN adopts a branch structure with decomposed spatial-temporal modules that enable multi-level CLIP features to be spatial-temporally contextualized. We evaluate our method on two representative video tasks: Video-Text Retrieval and Video Recognition. Extensive experiments demonstrate the superiority of our model over the state-of-the-art methods on various datasets, including MSR-VTT, DiDeMo, LSMDC, MSVD, Kinetics-400, and Something-Something-V2. Codes will be available at this https URL

Abstract (translated)

图像-文本预训练模型,例如 CLIP,从大规模图像-文本数据对中学习了令人印象深刻的普遍多模态知识,因此吸引了越来越多的注意力,因为它们有可能改善视频领域的视觉表示学习。在本文中,基于 CLIP 模型,我们回顾了图像-视频知识传输背景下的时间建模,这是将图像-文本预训练模型扩展到视频领域的关键点。我们发现,当前的时间建模机制针对高水平的语义 dominant 任务(例如,检索)或低水平的视觉模式 dominant 任务(例如,识别),无法同时处理两个情况。主要难点在于同时建模时间依赖关系,同时利用 CLIP 模型中的高水平和低水平知识。为了解决这个问题,我们提出了 Spatial-Temporal Auxiliary Network (STAN) - 一种简单有效的时间建模机制,将 CLIP 模型扩展到不同的视频任务。具体来说,为了实现高水平的和低水平知识传输,stan 采用分解空间时间的分支结构,使多个 CLIP 特征的空间时间上下文化。我们针对两个代表性视频任务:视频-文本检索和视频识别进行了评估。广泛的实验表明,我们的模型在包括 MSR-VTT、DiDeMo、LSMDC、MSVD、Kinetics-400 和 something-something-V2 等多种数据集上优于最先进的方法。代码将在 this https URL 中提供。

URL

https://arxiv.org/abs/2301.11116

PDF

https://arxiv.org/pdf/2301.11116.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot