Paper Reading AI Learner

Long-CLIP: Unlocking the Long-Text Capability of CLIP

2024-03-22 17:58:16
Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang

Abstract

Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-shot classification, text-image retrieval, and text-image generation by aligning image and text modalities. Despite its widespread adoption, a significant limitation of CLIP lies in the inadequate length of text input. The length of the text token is restricted to 77, and an empirical study shows the actual effective length is even less than 20. This prevents CLIP from handling detailed descriptions, limiting its applications for image retrieval and text-to-image generation with extensive prerequisites. To this end, we propose Long-CLIP as a plug-and-play alternative to CLIP that supports long-text input, retains or even surpasses its zero-shot generalizability, and aligns the CLIP latent space, making it readily replace CLIP without any further adaptation in downstream frameworks. Nevertheless, achieving this goal is far from straightforward, as simplistic fine-tuning can result in a significant degradation of CLIP's performance. Moreover, substituting the text encoder with a language model supporting longer contexts necessitates pretraining with vast amounts of data, incurring significant expenses. Accordingly, Long-CLIP introduces an efficient fine-tuning solution on CLIP with two novel strategies designed to maintain the original capabilities, including (1) a knowledge-preserved stretching of positional embedding and (2) a primary component matching of CLIP features. With leveraging just one million extra long text-image pairs, Long-CLIP has shown the superiority to CLIP for about 20% in long caption text-image retrieval and 6% in traditional text-image retrieval tasks, e.g., COCO and Flickr30k. Furthermore, Long-CLIP offers enhanced capabilities for generating images from detailed text descriptions by replacing CLIP in a plug-and-play manner.

Abstract (translated)

对比性语言-图像预训练(CLIP)是零散分类、文本图像检索和文本图像生成的基石,通过将图像和文本模态对齐。尽管CLIP得到了广泛的采用,但CLIP的一个显著局限在于文本输入的长度不足。文本标记的长度限制为77个,而一个经验性的研究表明,实际有效长度甚至比20个更少。这使得CLIP无法处理详细的描述,限制了其在图像检索和具有广泛先决条件的文本-图像生成方面的应用。 为此,我们提出了Long-CLIP作为CLIP的插件和备选方案,支持长文本输入,保留或甚至超越零散分布的泛化能力,并使CLIP潜在空间对齐,使得在下游框架中无需进一步调整即可替代CLIP。然而,实现这一目标并不容易,因为简单的微调可能会导致CLIP性能的显著下降。此外,用支持较长上下文的语言模型替换文本编码器需要大量的预训练数据,产生相当大的费用。因此,Long-CLIP通过两种新颖策略在CLIP上实现有效微调,包括(1)保留位置嵌入的知识伸展和(2)与CLIP特征的主要成分匹配。借助仅利用100万对额外长文本图像对,Long-CLIP在长摘要文本图像检索和传统文本图像检索任务(如COCO和Flickr30k)中已经证明了与CLIP约20%的优越性。此外,Long-CLIP通过在插件和备选方式下生成图像,取代CLIP,从而增强其生成图像的能力。

URL

https://arxiv.org/abs/2403.15378

PDF

https://arxiv.org/pdf/2403.15378.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot