Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification

Abstract
Abstract (translated)
URL
PDF

Abstract

CLIP showcases exceptional cross-modal matching capabilities due to its training on image-text contrastive learning tasks. However, without specific optimization for unimodal scenarios, its performance in single-modality feature extraction might be suboptimal. Despite this, some studies have directly used CLIP's image encoder for tasks like few-shot classification, introducing a misalignment between its pre-training objectives and feature extraction methods. This inconsistency can diminish the quality of the image's feature representation, adversely affecting CLIP's effectiveness in target tasks. In this paper, we view text features as precise neighbors of image features in CLIP's space and present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts. This feature extraction method aligns better with CLIP's pre-training objectives, thereby fully leveraging CLIP's robust cross-modal capabilities. The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images. We introduce the Auto Text Generator(ATG) to automatically generate the required texts in a data-free and training-free manner. We apply CODER to CLIP's zero-shot and few-shot image classification tasks. Experiment results across various datasets and models confirm CODER's effectiveness. Code is available at:this https URL.

Abstract (translated)

CLIP 通过在图像-文本对差学习任务上的训练展示了出色的跨模态匹配能力。然而，如果没有为单模态场景进行特定的优化，CLIP 在单模态特征提取方面的性能可能无法达到最优。尽管如此，一些研究表明，直接使用 CLIP 的图像编码器来执行少样本分类等任务，会导致其预训练目标与特征提取方法之间存在错位。这种不一致性可能会削弱图像特征的代表性，从而对 CLIP 在目标任务上的效果产生不利影响。在本文中，我们将文本特征视为 CLIP 空间中图像特征的精确邻近点，并提出了一个新的 Cross-Modal Neighbor Representation(CODER)，基于图像与它们邻居文本之间的距离结构。这种特征提取方法更接近 CLIP 的预训练目标，从而充分利用了 CLIP 的稳健跨模态能力。构建高质量的 CODER 的关键在于如何创建大量高质量和多样性的文本以与图像匹配。我们引入了自动文本生成器（ATG）以以无需数据和训练的方式生成所需文本。我们将 CODER 应用于 CLIP 的零样本和少样本图像分类任务。各种数据集和模型的实验结果证实了 CODER 的有效性。代码可在此处访问：https://this URL。

URL

https://arxiv.org/abs/2404.17753

PDF

https://arxiv.org/pdf/2404.17753.pdf

Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification

Abstract

Abstract (translated)

URL

PDF Copy

PDF