Abstract
While Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning for Large Language Models (LLMs), its performance often falls short of Full Fine-Tuning (Full FT). Current methods optimize LoRA by initializing with static singular value decomposition (SVD) subsets, leading to suboptimal leveraging of pre-trained knowledge. Another path for improving LoRA is incorporating a Mixture-of-Experts (MoE) architecture. However, weight misalignment and complex gradient dynamics make it challenging to adopt SVD prior to the LoRA MoE architecture. To mitigate these issues, we propose \underline{G}reat L\underline{o}R\underline{A} Mixture-of-Exper\underline{t} (GOAT), a framework that (1) adaptively integrates relevant priors using an SVD-structured MoE, and (2) aligns optimization with full fine-tuned MoE by deriving a theoretical scaling factor. We demonstrate that proper scaling, without modifying the architecture or training algorithms, boosts LoRA MoE's efficiency and performance. Experiments across 25 datasets, including natural language understanding, commonsense reasoning, image classification, and natural language generation, demonstrate GOAT's state-of-the-art performance, closing the gap with Full FT.
Abstract (translated)
虽然低秩适应(LoRA)允许大型语言模型(LLMs)以参数高效的方式进行微调,但其性能通常不如全量微调(Full Fine-Tuning, Full FT)。目前的方法通过使用静态奇异值分解(SVD)子集来初始化LoRA优化,但这导致了对预训练知识的有效利用不足。另一种改进LoRA的途径是结合专家混合(Mixture-of-Experts, MoE)架构。然而,权重不匹配和复杂的梯度动态使得在采用SVD之前将其应用到LoRA MoE架构中变得困难重重。为了解决这些问题,我们提出了“伟大低秩适应专家混合”框架(GOAT),它通过以下两个方式来改进: 1. 使用具有SVD结构的MoE自适应整合相关先验。 2. 通过推导一个理论上的缩放因子使优化与全量微调的MoE对齐。 我们证明,适当的缩放比例可以在不修改架构或训练算法的情况下提升LoRA MoE的效率和性能。在包括自然语言理解、常识推理、图像分类以及自然语言生成在内的25个数据集上进行的实验表明,GOAT实现了最先进的性能,并缩小了与全量微调之间的差距。
URL
https://arxiv.org/abs/2502.16894