Abstract
Contrastive Language-Image Pre-training (CLIP) has been widely studied and applied in numerous applications. However, the emphasis on brief summary texts during pre-training prevents CLIP from understanding long descriptions. This issue is particularly acute regarding videos given that videos often contain abundant detailed contents. In this paper, we propose the VideoCLIP-XL (eXtra Length) model, which aims to unleash the long-description understanding capability of video CLIP models. Firstly, we establish an automatic data collection system and gather a large-scale VILD pre-training dataset with VIdeo and Long-Description pairs. Then, we propose Text-similarity-guided Primary Component Matching (TPCM) to better learn the distribution of feature space while expanding the long description capability. We also introduce two new tasks namely Detail-aware Description Ranking (DDR) and Hallucination-aware Description Ranking (HDR) for further understanding improvement. Finally, we construct a Long Video Description Ranking (LVDR) benchmark for evaluating the long-description capability more comprehensively. Extensive experimental results on widely-used text-video retrieval benchmarks with both short and long descriptions and our LVDR benchmark can fully demonstrate the effectiveness of our method.
Abstract (translated)
对比语言图像预训练(CLIP)已经在许多应用中得到了广泛研究和应用。然而,在预训练过程中强调简短的总结文本,使得CLIP无法理解长描述。对于视频来说,这个问题尤为突出,因为视频通常包含丰富的详细内容。在本文中,我们提出了VideoCLIP-XL(扩展长度)模型,旨在释放视频CLIP模型的长描述理解能力。首先,我们建立了一个自动数据收集系统,收集了大规模的VILD预训练数据集,包括Video和Long-Description对。然后,我们提出了文本相似性引导的主成分匹配(TPCM)来更好地学习特征空间的同时扩展长描述能力。我们还引入了两个新任务,即 detail-aware 描述排名(DDR)和 Hallucination-aware 描述排名(HDR),以进一步改进理解能力。最后,我们构建了一个 Long Video Description Ranking(LVDR)基准,用于更全面地评估长描述能力。与短描述和长描述的广泛使用的文本视频检索基准以及我们的LVDR基准的实验结果可以充分证明我们方法的的有效性。
URL
https://arxiv.org/abs/2410.00741