CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

Abstract
Abstract (translated)
URL
PDF

Abstract

Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and text pairs poses computational challenges. This paper presents a novel weakly supervised pre-training of vision models on web-scale image-text data. The proposed method reframes pre-training on image-text data as a classification task. Consequently, it eliminates the need for pairwise similarity computations in contrastive loss, achieving a remarkable $2.7\times$ acceleration in training speed compared to contrastive learning on web-scale data. Through extensive experiments spanning diverse vision tasks, including detection and segmentation, we demonstrate that the proposed method maintains high representation quality. Our source code along with pre-trained model weights and training recipes is available at \url{this https URL}.

Abstract (translated)

对比学习已成为通过图像和文本嵌入之间的对齐来学习有效视觉表示的一种变革性方法。然而，在图像和文本对之间的对比损失计算中，计算对偶相似性提出了计算挑战。本文提出了一种在面向互联网大小的图像-文本数据上的弱监督预训练视觉模型的新方法。将图像-文本数据的预训练重新定义为分类任务。因此，它消除了在对比学习在互联网大小的数据上进行对偶相似性计算的需求，实现了与对比学习在互联网大小的数据上训练的速度相比，训练速度提高了2.7倍。通过广泛的实验，包括检测和分割等不同视觉任务，我们证明了所提出的方法具有高表示质量。我们的源代码以及预训练模型权重和训练 recipe可在此处访问：\url{这个链接}。

URL

https://arxiv.org/abs/2404.15653

PDF

https://arxiv.org/pdf/2404.15653.pdf

CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

Abstract

Abstract (translated)

URL

PDF Copy

PDF