Paper Reading AI Learner

Time-, Memory- and Parameter-Efficient Visual Adaptation

2024-02-05 10:55:47
Otniel-Bogdan Mercea, Alexey Gritsenko, Cordelia Schmid, Anurag Arnab

Abstract

As foundation models become more popular, there is a growing need to efficiently finetune them for downstream tasks. Although numerous adaptation methods have been proposed, they are designed to be efficient only in terms of how many parameters are trained. They, however, typically still require backpropagating gradients throughout the model, meaning that their training-time and -memory cost does not reduce as significantly. We propose an adaptation method which does not backpropagate gradients through the backbone. We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone. As a result, our method is efficient not only in terms of parameters, but also in training-time and memory usage. Our approach achieves state-of-the-art accuracy-parameter trade-offs on the popular VTAB benchmark, and we further show how we outperform prior works with respect to training-time and -memory usage too. We further demonstrate the training efficiency and scalability of our method by adapting a vision transformer backbone of 4 billion parameters for the computationally demanding task of video classification, without any intricate model parallelism. Here, we outperform a prior adaptor-based method which could only scale to a 1 billion parameter backbone, or fully-finetuning a smaller backbone, with the same GPU and less training time.

Abstract (translated)

作为基础模型越来越受欢迎,对于下游任务进行有效微调的需求不断增加。虽然已经提出了许多自适应方法,但它们的设计仅在参数训练方面有效。然而,它们通常还需要在模型中进行反向传播梯度,这意味着它们的训练时间和内存成本不会显著降低。我们提出了一种不通过骨干网络反向传播梯度的自适应方法。我们通过设计一个轻量级的并行网络来实现这个目标,该网络操作于预训练骨架中的特征。这样,我们的方法不仅在参数方面有效,而且在训练时间和内存使用方面也有效。我们的方法在热门的VTAB基准上实现了与最先进方法相同的准确率参数权衡,并且我们还进一步证明了我们在训练时间和内存使用方面的优势。为了进一步证明我们方法的训练效率和可扩展性,我们将一个具有40亿参数的视觉Transformer骨干网络调整为用于计算密集型视频分类任务的模型,而没有任何复杂的模型并行。在这里,我们超过了基于先前的自适应方法,该方法只能扩展到10亿参数的骨干网络,或者对较小的骨干网络进行完全微调,使用相同的GPU和更短的时间进行训练。

URL

https://arxiv.org/abs/2402.02887

PDF

https://arxiv.org/pdf/2402.02887.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot