Abstract
To alleviate the computational burden of large language models (LLMs), architectures with activation sparsity, represented by mixture-of-experts (MoE), have attracted increasing attention. However, the non-differentiable and inflexible routing of vanilla MoE hurts model performance. Moreover, while each token activates only a few parameters, these sparsely-activated architectures exhibit low chunk-level sparsity, indicating that the union of multiple consecutive tokens activates a large ratio of parameters. Such a sparsity pattern is unfriendly for acceleration under low-resource conditions (e.g., end-side devices) and incompatible with mainstream acceleration techniques (e.g., speculative decoding). To address these challenges, we introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques. Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing. Next, to promote both token-level sparsity (TLS) and chunk-level sparsity (CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly. Finally, we implement efficient acceleration kernels, combining activation sparsity and speculative decoding for the first time. The experimental results demonstrate the superior performance of BlockFFN over other MoE baselines, achieving over 80% TLS and 70% 8-token CLS. Our kernels achieve up to 3.67$\times$ speedup on real end-side devices than dense models. All codes and checkpoints are available publicly (this https URL).
Abstract (translated)
为了减轻大型语言模型(LLMs)的计算负担,采用激活稀疏架构(如专家混合MoE)吸引了越来越多的关注。然而,传统MoE中非可微和刚性的路由机制损害了模型性能。此外,尽管每个标记仅激活少数参数,但这些稀疏激活架构在块级表现出低稀疏性,即多个连续标记的组合会激活大量参数的比例。这种稀疏模式不利于资源受限条件(例如终端设备)下的加速,并且与主流加速技术(如投机解码)不兼容。为了解决这些问题,我们引入了一种新的MoE架构BlockFFN及其高效的训练和部署技术。具体而言,我们使用集成了ReLU激活和RMSNorm的路由器来实现可微分和灵活的路由机制。接下来,为了同时促进标记级稀疏性(TLS)和块级稀疏性(CLS),设计了CLS感知的训练目标,使得BlockFFN更加易于加速。最后,我们实现了高效的加速内核,并首次结合了激活稀疏性和投机解码技术。实验结果显示,BlockFFN在其他MoE基准模型上的性能优越,达到了超过80%的TLS和70%八标记CLS(Chunk-Level Sparsity)。我们的内核实现在真实终端设备上比密集模型快达3.67倍。所有代码和检查点均可公开访问(此 https URL 链接)。
URL
https://arxiv.org/abs/2507.08771