Abstract
The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less ($<$35\%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at this https URL.
Abstract (translated)
对比性语言-图像预训练(CLIP)的成功取决于图像与摘要之间的配对监督,而这类数据往往存在噪声。我们提出了混合数据专家(MoDE)方法并通过聚类学习系统。每个数据专家在一个数据聚类上进行训练,对其他聚类的虚假负噪声更不敏感。在推理时,我们通过任务元数据与聚类条件的关联来应用权重。为了精确估计相关性,一个聚类的样本应该在语义上相似,但数据专家的数量仍应保持在训练和推理的合理范围内。因此,我们在人类语言的语义层次上考虑元数据,并建议在粗粒度层面使用细粒度聚类中心来表示每个数据专家。实验研究表明,在ViT-B/16上,四个CLIP数据专家超过了ViT-L/14上的OpenAI CLIP和OpenCLIP在零散图像分类上的表现,但训练成本较低(<35%)。与此同时,MoDE可以异步训练所有数据专家,并可以灵活地包括新的数据专家。代码可在此处下载:https://thisurl.com
URL
https://arxiv.org/abs/2404.16030