Long-CLIP: Unlocking the Long-Text Capability of CLIP

Abstract
Abstract (translated)
URL
PDF

Abstract

Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-shot classification, text-image retrieval, and text-image generation by aligning image and text modalities. Despite its widespread adoption, a significant limitation of CLIP lies in the inadequate length of text input. The length of the text token is restricted to 77, and an empirical study shows the actual effective length is even less than 20. This prevents CLIP from handling detailed descriptions, limiting its applications for image retrieval and text-to-image generation with extensive prerequisites. To this end, we propose Long-CLIP as a plug-and-play alternative to CLIP that supports long-text input, retains or even surpasses its zero-shot generalizability, and aligns the CLIP latent space, making it readily replace CLIP without any further adaptation in downstream frameworks. Nevertheless, achieving this goal is far from straightforward, as simplistic fine-tuning can result in a significant degradation of CLIP's performance. Moreover, substituting the text encoder with a language model supporting longer contexts necessitates pretraining with vast amounts of data, incurring significant expenses. Accordingly, Long-CLIP introduces an efficient fine-tuning solution on CLIP with two novel strategies designed to maintain the original capabilities, including (1) a knowledge-preserved stretching of positional embedding and (2) a primary component matching of CLIP features. With leveraging just one million extra long text-image pairs, Long-CLIP has shown the superiority to CLIP for about 20% in long caption text-image retrieval and 6% in traditional text-image retrieval tasks, e.g., COCO and Flickr30k. Furthermore, Long-CLIP offers enhanced capabilities for generating images from detailed text descriptions by replacing CLIP in a plug-and-play manner.

Abstract (translated)

对比性语言-图像预训练（CLIP）是零散分类、文本图像检索和文本图像生成的基石，通过将图像和文本模态对齐。尽管CLIP得到了广泛的采用，但CLIP的一个显著局限在于文本输入的长度不足。文本标记的长度限制为77个，而一个经验性的研究表明，实际有效长度甚至比20个更少。这使得CLIP无法处理详细的描述，限制了其在图像检索和具有广泛先决条件的文本-图像生成方面的应用。为此，我们提出了Long-CLIP作为CLIP的插件和备选方案，支持长文本输入，保留或甚至超越零散分布的泛化能力，并使CLIP潜在空间对齐，使得在下游框架中无需进一步调整即可替代CLIP。然而，实现这一目标并不容易，因为简单的微调可能会导致CLIP性能的显著下降。此外，用支持较长上下文的语言模型替换文本编码器需要大量的预训练数据，产生相当大的费用。因此，Long-CLIP通过两种新颖策略在CLIP上实现有效微调，包括（1）保留位置嵌入的知识伸展和（2）与CLIP特征的主要成分匹配。借助仅利用100万对额外长文本图像对，Long-CLIP在长摘要文本图像检索和传统文本图像检索任务（如COCO和Flickr30k）中已经证明了与CLIP约20%的优越性。此外，Long-CLIP通过在插件和备选方式下生成图像，取代CLIP，从而增强其生成图像的能力。

URL

https://arxiv.org/abs/2403.15378

PDF

https://arxiv.org/pdf/2403.15378.pdf

Long-CLIP: Unlocking the Long-Text Capability of CLIP

Abstract

Abstract (translated)

URL

PDF Copy

PDF