Paper Reading AI Learner

VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models

2024-10-01 14:33:22
Jiapeng Wang, Chengyu Wang, Kunzhe Huang, Jun Huang, Lianwen Jin

Abstract

Contrastive Language-Image Pre-training (CLIP) has been widely studied and applied in numerous applications. However, the emphasis on brief summary texts during pre-training prevents CLIP from understanding long descriptions. This issue is particularly acute regarding videos given that videos often contain abundant detailed contents. In this paper, we propose the VideoCLIP-XL (eXtra Length) model, which aims to unleash the long-description understanding capability of video CLIP models. Firstly, we establish an automatic data collection system and gather a large-scale VILD pre-training dataset with VIdeo and Long-Description pairs. Then, we propose Text-similarity-guided Primary Component Matching (TPCM) to better learn the distribution of feature space while expanding the long description capability. We also introduce two new tasks namely Detail-aware Description Ranking (DDR) and Hallucination-aware Description Ranking (HDR) for further understanding improvement. Finally, we construct a Long Video Description Ranking (LVDR) benchmark for evaluating the long-description capability more comprehensively. Extensive experimental results on widely-used text-video retrieval benchmarks with both short and long descriptions and our LVDR benchmark can fully demonstrate the effectiveness of our method.

Abstract (translated)

对比语言图像预训练(CLIP)已经在许多应用中得到了广泛研究和应用。然而,在预训练过程中强调简短的总结文本,使得CLIP无法理解长描述。对于视频来说,这个问题尤为突出,因为视频通常包含丰富的详细内容。在本文中,我们提出了VideoCLIP-XL(扩展长度)模型,旨在释放视频CLIP模型的长描述理解能力。首先,我们建立了一个自动数据收集系统,收集了大规模的VILD预训练数据集,包括Video和Long-Description对。然后,我们提出了文本相似性引导的主成分匹配(TPCM)来更好地学习特征空间的同时扩展长描述能力。我们还引入了两个新任务,即 detail-aware 描述排名(DDR)和 Hallucination-aware 描述排名(HDR),以进一步改进理解能力。最后,我们构建了一个 Long Video Description Ranking(LVDR)基准,用于更全面地评估长描述能力。与短描述和长描述的广泛使用的文本视频检索基准以及我们的LVDR基准的实验结果可以充分证明我们方法的的有效性。

URL

https://arxiv.org/abs/2410.00741

PDF

https://arxiv.org/pdf/2410.00741.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot