Paper Reading AI Learner

Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification

2024-04-27 02:04:36
Chao Yi, Lu Ren, De-Chuan Zhan, Han-Jia Ye

Abstract

CLIP showcases exceptional cross-modal matching capabilities due to its training on image-text contrastive learning tasks. However, without specific optimization for unimodal scenarios, its performance in single-modality feature extraction might be suboptimal. Despite this, some studies have directly used CLIP's image encoder for tasks like few-shot classification, introducing a misalignment between its pre-training objectives and feature extraction methods. This inconsistency can diminish the quality of the image's feature representation, adversely affecting CLIP's effectiveness in target tasks. In this paper, we view text features as precise neighbors of image features in CLIP's space and present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts. This feature extraction method aligns better with CLIP's pre-training objectives, thereby fully leveraging CLIP's robust cross-modal capabilities. The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images. We introduce the Auto Text Generator(ATG) to automatically generate the required texts in a data-free and training-free manner. We apply CODER to CLIP's zero-shot and few-shot image classification tasks. Experiment results across various datasets and models confirm CODER's effectiveness. Code is available at:this https URL.

Abstract (translated)

CLIP 通过在图像-文本对差学习任务上的训练展示了出色的跨模态匹配能力。然而,如果没有为单模态场景进行特定的优化,CLIP 在单模态特征提取方面的性能可能无法达到最优。尽管如此,一些研究表明,直接使用 CLIP 的图像编码器来执行少样本分类等任务,会导致其预训练目标与特征提取方法之间存在错位。这种不一致性可能会削弱图像特征的代表性,从而对 CLIP 在目标任务上的效果产生不利影响。在本文中,我们将文本特征视为 CLIP 空间中图像特征的精确邻近点,并提出了一个新的 Cross-Modal Neighbor Representation(CODER),基于图像与它们邻居文本之间的距离结构。这种特征提取方法更接近 CLIP 的预训练目标,从而充分利用了 CLIP 的稳健跨模态能力。构建高质量的 CODER 的关键在于如何创建大量高质量和多样性的文本以与图像匹配。我们引入了自动文本生成器(ATG)以以无需数据和训练的方式生成所需文本。我们将 CODER 应用于 CLIP 的零样本和少样本图像分类任务。各种数据集和模型的实验结果证实了 CODER 的有效性。代码可在此处访问:https://this URL。

URL

https://arxiv.org/abs/2404.17753

PDF

https://arxiv.org/pdf/2404.17753.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot