Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning

Abstract
Abstract (translated)
URL
PDF

Abstract

Contrastive vision-language models (e.g. CLIP) are typically created by updating all the parameters of a vision model and language model through contrastive training. Can such models be created by a small number of parameter updates to an already-trained language model and vision model? The literature describes techniques that can create vision-language models by updating a small number of parameters in a language model, but these require already aligned visual representations and are non-contrastive, hence unusable for latency-sensitive applications such as neural search. We explore the feasibility and benefits of parameter-efficient contrastive vision-language alignment through transfer learning: creating a model such as CLIP by minimally updating an already-trained vision and language model. We find that a minimal set of parameter updates ($<$7%) can achieve the same performance as full-model training, and updating specific components ($<$1% of parameters) can match 75% of full-model training. We describe a series of experiments: we show that existing knowledge is conserved more strongly in parameter-efficient training and that parameter-efficient scaling scales with model and dataset size. Where paired-image text data is scarce but strong multilingual language models exist (e.g. low resource languages), parameter-efficient training is even preferable to full-model training. Given a fixed compute budget, parameter-efficient training allows training larger models on the same hardware, achieving equivalent performance in less time. Parameter-efficient training hence constitutes an energy-efficient and effective training strategy for contrastive vision-language models that may be preferable to the full-model training paradigm for common use cases. Code and weights at this https URL.

Abstract (translated)

对比视觉语言模型(例如 CLIP)通常通过contrastive训练更新视觉模型和语言模型的所有参数来实现。是否可以通过少量参数更新来创建已经训练好的视觉模型和语言模型呢?文献描述了可以更新语言模型中的少量参数来创建视觉语言模型的技术,但这些需要已经对齐的视觉表示,并且是不可比较的,因此不适合像神经网络搜索这样的延迟敏感应用。我们通过 Transfer Learning 来探索高效参数对比的视觉-语言对齐的可行性和好处:通过最小化已经训练好的视觉和语言模型的参数更新来创建类似于 CLIP 的模型。我们发现,仅进行一次参数更新(不到7%)可以实现与全模型训练相同的性能,而更新特定的组件(不到参数的1%)可以匹配75%的全模型训练。我们描述了一系列实验:我们证明了在高效参数训练中,现有知识更容易被保存,而且高效参数训练的 scaling 与模型和数据集大小成正比。当配对图像文本数据有限但存在强大的多语言语言模型时(例如资源有限的语言),高效参数训练甚至优于全模型训练。给定固定的计算预算,高效参数训练可以在相同的硬件上训练更大的模型,在更短的时间内实现同等性能。因此,高效参数训练构成了对比视觉语言模型的高效、有效训练策略,对于常见的应用场景可能是更好的选择。代码和权重在这个 https URL 上。

URL

https://arxiv.org/abs/2303.11866

PDF

https://arxiv.org/pdf/2303.11866.pdf