Abstract
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. This makes the nature of the training data a significant factor in the efficacy of CLIP for downstream tasks. However, the lack of compositional diversity in contemporary image-text datasets limits the compositional reasoning ability of CLIP. We show that generating ``hard'' negative captions via in-context learning and synthesizing corresponding negative images with text-to-image generators offers a solution. We introduce a novel contrastive pre-training strategy that leverages these hard negative captions and images in an alternating fashion to train CLIP. We demonstrate that our method, named TripletCLIP, when applied to existing datasets such as CC3M and CC12M, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark on an equal computational budget, as well as improvements in zero-shot image classification and image retrieval. Our code, models, and data are available at: this https URL
Abstract (translated)
对比语言-图像预训练(CLIP)模型通过最大化文本和视觉模态之间的互信息来学习表示。这使得训练数据的性质成为影响CLIP在下游任务中效果的重要因素。然而,当代图像-文本数据集中缺乏组合多样性限制了CLIP的组合推理能力。我们展示了通过上下文学习生成“难”的负样本描述,并使用文字转图像生成器合成相应的负面图像提供了解决方案。我们引入了一种新颖的对比预训练策略,该策略交替利用这些困难的负样本描述和图像来训练CLIP。我们的方法名为TripletCLIP,在应用于现有数据集如CC3M和CC12M时,能够提升CLIP的组合能力,并在SugarCrepe基准测试中,以相同的计算预算实现了超过9%的绝对性能改进,同时还在零样本图像分类和图像检索方面取得了改善。我们的代码、模型和数据可在以下链接获取:此 https URL
URL
https://arxiv.org/abs/2411.02545