Abstract
The advancement of deep learning has led to the emergence of Mixture-of-Experts (MoEs) models, known for their dynamic allocation of computational resources based on input. Despite their promise, MoEs face challenges, particularly in terms of memory requirements. To address this, our work introduces SEER-MoE, a novel two-stage framework for reducing both the memory footprint and compute requirements of pre-trained MoE models. The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss and reduce the number of activated experts during inference. Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.
Abstract (translated)
深度学习的进步导致了混合专家(MoEs)模型的出现,这些模型以基于输入的动态分配计算资源而闻名。尽管它们具有很大的潜力,但MoEs在内存需求方面面临挑战。为了应对这个问题,我们的工作引入了SEER-MoE,一种新的两阶段框架,用于减少预训练MoEs模型的内存足迹和计算需求。第一阶段涉及使用重叠抽头计数指导对总专家数量进行截断,而第二阶段采用基于正则化的微调策略来恢复准确性损失和减少推理过程中的激活专家数量。我们的实证研究证明了我们的方法的有效性,从而实现了一个稀疏的MoEs模型,在保持推理效率的同时最小化准确度损失。
URL
https://arxiv.org/abs/2404.05089