Paper Reading AI Learner

Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy

2025-07-17 16:09:05
Yiting Yang, Hao Luo, Yuan Sun, Qingsen Yan, Haokui Zhang, Wei Dong, Guoqing Wang, Peng Wang, Yang Yang, Hengtao Shen

Abstract

A prevalent approach in Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViT) involves freezing the majority of the backbone parameters and solely learning low-rank adaptation weight matrices to accommodate downstream tasks. These low-rank matrices are commonly derived through the multiplication structure of down-projection and up-projection matrices, exemplified by methods such as LoRA and Adapter. In this work, we observe an approximate orthogonality among any two row or column vectors within any weight matrix of the backbone parameters; however, this property is absent in the vectors of the down/up-projection matrices. Approximate orthogonality implies a reduction in the upper bound of the model's generalization error, signifying that the model possesses enhanced generalization capability. If the fine-tuned down/up-projection matrices were to exhibit this same property as the pre-trained backbone matrices, could the generalization capability of fine-tuned ViTs be further augmented? To address this question, we propose an Approximately Orthogonal Fine-Tuning (AOFT) strategy for representing the low-rank weight matrices. This strategy employs a single learnable vector to generate a set of approximately orthogonal vectors, which form the down/up-projection matrices, thereby aligning the properties of these matrices with those of the backbone. Extensive experimental results demonstrate that our method achieves competitive performance across a range of downstream image classification tasks, confirming the efficacy of the enhanced generalization capability embedded in the down/up-projection matrices.

Abstract (translated)

在参数高效的微调(Parameter-Efficient Fine-Tuning,PEFT)中,预训练的Vision Transformer (ViT)模型通常会冻结大部分骨干参数,并仅学习低秩适应权重矩阵以适应下游任务。这些低秩矩阵通常是通过降维和升维矩阵的乘法结构得到的,如LoRA和Adapter等方法所展示的那样。在这项工作中,我们观察到,在任何预训练骨干参数的权重矩阵中,任意两行或两列向量之间存在近似的正交性;然而,这种性质在降维/升维矩阵的向量中是不存在的。近似正交性意味着模型泛化误差上限的减小,表明该模型具有增强的泛化能力。如果微调后的降维/升维矩阵也能表现出与预训练骨干矩阵相同的这种特性,那么微调后的ViT模型是否可以进一步提高其泛化能力? 为了解答这个问题,我们提出了一种近似正交微调(Approximately Orthogonal Fine-Tuning, AOFT)策略来表示低秩权重矩阵。该策略使用一个可学习的向量生成一组近似正交的向量,这些向量构成降维/升维矩阵,并使这些矩阵的性质与骨干模型一致。 大量的实验结果表明,我们的方法在一系列下游图像分类任务中表现出具有竞争力的性能,证实了嵌入降维/升维矩阵中的增强泛化能力的有效性。

URL

https://arxiv.org/abs/2507.13260

PDF

https://arxiv.org/pdf/2507.13260.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot