Time-, Memory- and Parameter-Efficient Visual Adaptation

Abstract
Abstract (translated)
URL
PDF

Abstract

As foundation models become more popular, there is a growing need to efficiently finetune them for downstream tasks. Although numerous adaptation methods have been proposed, they are designed to be efficient only in terms of how many parameters are trained. They, however, typically still require backpropagating gradients throughout the model, meaning that their training-time and -memory cost does not reduce as significantly. We propose an adaptation method which does not backpropagate gradients through the backbone. We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone. As a result, our method is efficient not only in terms of parameters, but also in training-time and memory usage. Our approach achieves state-of-the-art accuracy-parameter trade-offs on the popular VTAB benchmark, and we further show how we outperform prior works with respect to training-time and -memory usage too. We further demonstrate the training efficiency and scalability of our method by adapting a vision transformer backbone of 4 billion parameters for the computationally demanding task of video classification, without any intricate model parallelism. Here, we outperform a prior adaptor-based method which could only scale to a 1 billion parameter backbone, or fully-finetuning a smaller backbone, with the same GPU and less training time.

Abstract (translated)

作为基础模型越来越受欢迎，对于下游任务进行有效微调的需求不断增加。虽然已经提出了许多自适应方法，但它们的设计仅在参数训练方面有效。然而，它们通常还需要在模型中进行反向传播梯度，这意味着它们的训练时间和内存成本不会显著降低。我们提出了一种不通过骨干网络反向传播梯度的自适应方法。我们通过设计一个轻量级的并行网络来实现这个目标，该网络操作于预训练骨架中的特征。这样，我们的方法不仅在参数方面有效，而且在训练时间和内存使用方面也有效。我们的方法在热门的VTAB基准上实现了与最先进方法相同的准确率参数权衡，并且我们还进一步证明了我们在训练时间和内存使用方面的优势。为了进一步证明我们方法的训练效率和可扩展性，我们将一个具有40亿参数的视觉Transformer骨干网络调整为用于计算密集型视频分类任务的模型，而没有任何复杂的模型并行。在这里，我们超过了基于先前的自适应方法，该方法只能扩展到10亿参数的骨干网络，或者对较小的骨干网络进行完全微调，使用相同的GPU和更短的时间进行训练。

URL

https://arxiv.org/abs/2402.02887

PDF

https://arxiv.org/pdf/2402.02887.pdf

Time-, Memory- and Parameter-Efficient Visual Adaptation

Abstract

Abstract (translated)

URL

PDF Copy

PDF