Paper Reading AI Learner

HEAT: Head-level Parameter Efficient Adaptation of Vision Transformers with Taylor-expansion Importance Scores

2024-04-13 04:01:35
Yibo Zhong, Yao Zhou

Abstract

Prior computer vision research extensively explores adapting pre-trained vision transformers (ViT) to downstream tasks. However, the substantial number of parameters requiring adaptation has led to a focus on Parameter Efficient Transfer Learning (PETL) as an approach to efficiently adapt large pre-trained models by training only a subset of parameters, achieving both parameter and storage efficiency. Although the significantly reduced parameters have shown promising performance under transfer learning scenarios, the structural redundancy inherent in the model still leaves room for improvement, which warrants further investigation. In this paper, we propose Head-level Efficient Adaptation with Taylor-expansion importance score (HEAT): a simple method that efficiently fine-tuning ViTs at head levels. In particular, the first-order Taylor expansion is employed to calculate each head's importance score, termed Taylor-expansion Importance Score (TIS), indicating its contribution to specific tasks. Additionally, three strategies for calculating TIS have been employed to maximize the effectiveness of TIS. These strategies calculate TIS from different perspectives, reflecting varying contributions of parameters. Besides ViT, HEAT has also been applied to hierarchical transformers such as Swin Transformer, demonstrating its versatility across different transformer architectures. Through extensive experiments, HEAT has demonstrated superior performance over state-of-the-art PETL methods on the VTAB-1K benchmark.

Abstract (translated)

之前的计算机视觉研究广泛探讨将预训练的视觉 transformer(ViT)适应下游任务。然而,需要适应的参数数量巨大导致了对参数高效的迁移学习(PETL)方法的关注,该方法通过仅训练模型的部分参数,以高效地适应大预训练模型。尽管在迁移学习场景下,显著减少的参数已经表现出良好的性能,但模型的结构冗余仍然需要改进,这需要进一步的研究。在本文中,我们提出了HEAT:一种简单的方法,在头部级别上高效地微调ViT。 具体来说,第一阶泰勒展开被用于计算每个头的的重要性分数,称之为泰勒展开重要性分数(TIS),表示其对特定任务的贡献。此外,为了最大化TIS的有效性,我们采用了三种计算TIS的方法。这些方法从不同的角度计算TIS,反映了参数对特定任务的不同贡献。除了ViT之外,HEAT还应用于如Swin Transformer这样的分层 transformer,证明了其在不同变换器架构上的 versatility。 通过大量实验,HEAT 在VTAB-1K基准上已经证明了与最先进的PETL方法相比的优越性能。

URL

https://arxiv.org/abs/2404.08894

PDF

https://arxiv.org/pdf/2404.08894.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot