Paper Reading AI Learner

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA based Mixture of Experts

2024-04-22 02:15:52

Dengchun Li, Yingzi Ma, Naizheng Wang, Zhiyuan Cheng, Lei Duan, Jie Zuo, Cal Yang, Mingjie Tang

arXiv_AI

arXiv_AI Attention Sparse Language_Model Pose Quantization LLM

Abstract
Abstract (translated)
URL
PDF

Abstract

Large Language Models (LLMs) have showcased exceptional performance across a wide array of Natural Language Processing (NLP) tasks. Fine-tuning techniques are commonly utilized to tailor pre-trained models to specific applications. While methods like LoRA have effectively tackled GPU memory constraints during fine-tuning, their applicability is often restricted to limited performance, especially on multi-task. On the other hand, Mix-of-Expert (MoE) models, such as Mixtral 8x7B, demonstrate remarkable performance across multiple NLP tasks while maintaining a reduced parameter count. However, the resource requirements of these MoEs still challenging, particularly for consumer-grade GPUs only have limited VRAM. To address these challenge, we propose MixLoRA, an innovative approach aimed at constructing a resource-efficient sparse MoE model based on LoRA. MixLoRA inserts multiple LoRA-based experts within the feed-forward network block of a frozen pre-trained dense model through fine-tuning, employing a commonly used top-k router. Unlike other LoRA based MoE methods, MixLoRA enhances model performance by utilizing independently configurable attention-layer LoRA adapters, supporting the use of LoRA and its variants for the construction of experts, and applying auxiliary load balance loss to address the imbalance problem of the router. In experiments, MixLoRA achieves commendable performance across all evaluation metrics in both single-task and multi-task learning scenarios. Implemented within the m-LoRA framework, MixLoRA enables parallel fine-tuning of multiple mixture-of-experts models on a single 24GB consumer-grade GPU without quantization, thereby reducing GPU memory consumption by 41\% and latency during the training process by 17\%.

Abstract (translated)

大语言模型（LLMs）在广泛的自然语言处理（NLP）任务中展现了卓越的表现。为了将预训练模型定制到特定应用，通常会使用微调技术来调整预训练模型。与像LoRA这样的方法有效地解决GPU内存限制并在微调过程中取得了显著进展不同，这些方法的应用范围通常受到限制，尤其是在多任务处理情况下。另一方面，像Mix-of-Expert（MoE）模型，如Mixtral 8x7B，在多个NLP任务上表现出色，同时保持参数数量减少。然而，这些MoE模型的资源需求仍然很高，尤其是对于仅具有有限VRAM的消费级GPU来说。为解决这些挑战，我们提出了MixLoRA，一种旨在基于LoRA构建资源高效的稀疏MoE模型的创新方法。MixLoRA通过在预训练密度模型的高级网络模块中插入多个LoRA基专家，并使用常用的top-k路由器进行微调，从而在输入网络中实现对LoRA和其变体的利用。与其他基于LoRA的MoE方法不同，MixLoRA通过独立配置注意层LoRA适配器来增强模型性能，支持使用LoRA及其变体构建专家，并应用辅助负载平衡损失来解决路由器不平衡问题。在实验中，MixLoRA在单任务和多任务学习场景下的所有评估指标都取得了卓越的成绩。借助m-LoRA框架，MixLoRA可以在单个24GB消费级GPU上并行微调多个混合专家模型，从而在训练过程中减少41%的GPU内存消耗和17%的延迟。

URL

https://arxiv.org/abs/2404.15159

PDF

https://arxiv.org/pdf/2404.15159.pdf

Tags