Paper Reading AI Learner

Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data

2023-10-08 04:46:43
Zuxuan Wu, Zejia Weng, Wujian Peng, Xitong Yang, Ang Li, Larry S. Davis, Yu-Gang Jiang

Abstract

Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in zero-shot image recognition, limited effort has been made exploring its potential for zero-shot video recognition. This paper presents Open-VCLIP++, a simple yet effective framework that adapts CLIP to a strong zero-shot video classifier, capable of identifying novel actions and events during testing. Open-VCLIP++ minimally modifies CLIP to capture spatial-temporal relationships in videos, thereby creating a specialized video classifier while striving for generalization. We formally demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data. To address this problem, we introduce Interpolated Weight Optimization, a technique that leverages the advantages of weight interpolation during both training and testing. Furthermore, we build upon large language models to produce fine-grained video descriptions. These detailed descriptions are further aligned with video features, facilitating a better transfer of CLIP to the video domain. Our approach is evaluated on three widely used action recognition datasets, following a variety of zero-shot evaluation protocols. The results demonstrate that our method surpasses existing state-of-the-art techniques by significant margins. Specifically, we achieve zero-shot accuracy scores of 88.1%, 58.7%, and 81.2% on UCF, HMDB, and Kinetics-600 datasets respectively, outpacing the best-performing alternative methods by 8.5%, 8.2%, and 12.3%. We also evaluate our approach on the MSR-VTT video-text retrieval dataset, where it delivers competitive video-to-text and text-to-video retrieval performance, while utilizing substantially less fine-tuning data compared to other methods. Code is released at this https URL.

Abstract (translated)

尽管在零散图像识别中取得显著的成果,但关于零散视频识别的可能性,投入的努力还相对较少。本文介绍了一个简单而有效的框架Open-VCLIP++,将CLIP适应于强大的零散视频分类器,能够识别测试期间的全新动作和事件。Open-VCLIP++对CLIP进行了最小修改,以捕捉视频中的空间时间关系,从而在尝试通用性的同时创建了一个专用视频分类器。我们正式证明,训练Open-VCLIP++等同于在零散数据上进行持续学习。为解决这个问题,我们引入了平滑权重优化技术,这种技术利用了在训练和测试期间进行权重平滑的优势。此外,我们还基于大型语言模型产生了精细的视频描述。这些详细的描述与视频特征进一步对齐,有助于将CLIP更好地应用于视频领域。我们对三个广泛使用的动作识别数据集进行了评估,采用了多种零散评估协议。结果表明,我们的方法在现有技术水平上显著领先。具体来说,我们在UCF、HMDB和Kinetics-600数据集上分别实现了88.1%、58.7%和81.2%的零散准确率,比最佳替代方法分别领先8.5%、8.2%和12.3%。我们还评估了我们的方法在MSR-VTT视频文本检索数据集上的表现,它在该数据集上实现了与其它方法相当的竞争视频到文本和文本到视频的检索性能,同时大大减少了细粒度的数据量。代码发布在https://这个URL上。

URL

https://arxiv.org/abs/2310.05010

PDF

https://arxiv.org/pdf/2310.05010.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot