Paper Reading AI Learner

Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting

2023-04-06 18:00:04
Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, Mubarak Shah

Abstract

Adopting contrastive image-text pretrained models like CLIP towards video classification has gained attention due to its cost-effectiveness and competitive performance. However, recent works in this area face a trade-off. Finetuning the pretrained model to achieve strong supervised performance results in low zero-shot generalization. Similarly, freezing the backbone to retain zero-shot capability causes significant drop in supervised accuracy. Because of this, recent works in literature typically train separate models for supervised and zero-shot action recognition. In this work, we propose a multimodal prompt learning scheme that works to balance the supervised and zero-shot performance under a single unified training. Our prompting approach on the vision side caters for three aspects: 1) Global video-level prompts to model the data distribution; 2) Local frame-level prompts to provide per-frame discriminative conditioning; and 3) a summary prompt to extract a condensed video representation. Additionally, we define a prompting scheme on the text side to augment the textual context. Through this prompting scheme, we can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting. By keeping the pretrained backbone frozen, we optimize a much lower number of parameters and retain the existing general representation which helps achieve the strong zero-shot performance. Our codes/models are released at this https URL.

Abstract (translated)

采用像CLIP这样的对比图像文本预训练模型来进行视频分类引起了关注,因为其成本效益和竞争力表现都很出色。然而,该领域最近的工作面临着一个权衡。通过精细调整预训练模型,可以实现强有力的监督表现,却会导致低零样本泛化能力。类似地,将主干线冻结以保持零样本能力会显著降低监督准确性。因此,文学界的最近工作通常分别训练监督和零样本动作识别模型。在本研究中,我们提出了一种多模式提示学习方案,旨在在一个统一训练中平衡监督性能和零样本性能。我们的视觉提示方法涵盖三个方面:1)全球视频级别提示来建模数据分布;2)局部帧级别提示来提供每帧特定的 conditioning;3)一个简要提示来提取压缩视频表示。此外,我们还定义了一个文本提示方案来增强文本上下文。通过这个提示方案,我们可以在Kinetics-600、HMDB51和UCF101等模型上实现最先进的零样本性能,同时保持监督环境下的竞争力。通过保持预训练主干线冻结,我们优化了较少的参数,并保留了现有的通用表示,这有助于实现强大的零样本性能。我们的代码/模型在这个httpsURL上发布。

URL

https://arxiv.org/abs/2304.03307

PDF

https://arxiv.org/pdf/2304.03307.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot