AAPL: Adding Attributes to Prompt Learning for Vision-Language Models

Abstract
Abstract (translated)
URL
PDF

Abstract

Recent advances in large pre-trained vision-language models have demonstrated remarkable performance on zero-shot downstream tasks. Building upon this, recent studies, such as CoOp and CoCoOp, have proposed the use of prompt learning, where context within a prompt is replaced with learnable vectors, leading to significant improvements over manually crafted prompts. However, the performance improvement for unseen classes is still marginal, and to tackle this problem, data augmentation has been frequently used in traditional zero-shot learning techniques. Through our experiments, we have identified important issues in CoOp and CoCoOp: the context learned through traditional image augmentation is biased toward seen classes, negatively impacting generalization to unseen classes. To address this problem, we propose adversarial token embedding to disentangle low-level visual augmentation features from high-level class information when inducing bias in learnable prompts. Through our novel mechanism called "Adding Attributes to Prompt Learning", AAPL, we guide the learnable context to effectively extract text features by focusing on high-level features for unseen classes. We have conducted experiments across 11 datasets, and overall, AAPL shows favorable performances compared to the existing methods in few-shot learning, zero-shot learning, cross-dataset, and domain generalization tasks.

Abstract (translated)

近年来，大型预训练视觉语言模型在零散分布任务上的表现已经引人注目。在此基础上，一些研究，如CoOp和CoCoOp，提出了使用提示学习的方法，其中上下文在提示中替换为可学习向量，从而在手动设计的提示上取得了显著的改进。然而，对于未见过的类别的性能提升仍然很小，为了解决这个问题，传统零散学习技术中经常使用数据增强。通过我们的实验，我们发现了CoOp和CoCoOp中重要的问题：通过传统图像增强学习到的上下文存在偏见，不利于对未见过的类别的泛化。为了解决这个问题，我们提出了一个对抗性标记嵌入策略，当在提示中诱导偏见时，将低级视觉增强特征与高级分类信息分离。通过我们新颖的机制“在提示中添加属性”，AAPL，我们引导可学习上下文有效地提取未见过的类别的文本特征。我们在11个数据集上进行了实验，总体而言，AAPL在零散分布学习、少样本学习、跨数据集学习和领域泛化任务上的表现与现有方法相比具有优势。

URL

https://arxiv.org/abs/2404.16804

PDF

https://arxiv.org/pdf/2404.16804.pdf

AAPL: Adding Attributes to Prompt Learning for Vision-Language Models

Abstract

Abstract (translated)

URL

PDF Copy

PDF