Abstract
The rapid advancement in Large Language Models (LLMs) has markedly enhanced the capabilities of language understanding and generation. However, the substantial model size poses hardware challenges, affecting both memory size for serving and inference latency for token generation. To address those challenges, we propose Dependency-aware Semi-structured Sparsity (DaSS), a novel method for the recent prevalent SwiGLU-based LLMs pruning. Our approach incorporates structural dependency into the weight magnitude-based unstructured pruning. We introduce an MLP-specific pruning metric that evaluates the importance of each weight by jointly considering its magnitude and its corresponding MLP intermediate activation norms. DaSS facilitates a balance between the adaptability offered by unstructured pruning and the structural consistency inherent in dependency-based structured pruning. Empirical evaluations on Mistral and LLaMA2 model families demonstrate that DaSS not only outperforms both SparseGPT and Wanda in achieving hardware-friendly N:M sparsity patterns but also maintains the computational efficiency of Wanda.
Abstract (translated)
大规模语言模型(LLMs)的快速发展大大提高了自然语言理解和生成的能力。然而,大型模型的庞大尺寸带来了硬件挑战,影响了用于生成和推理延迟的内存大小。为了应对这些挑战,我们提出了Dependency-aware Semi-structured Sparsity(DaSS),一种基于SwiGLU的LLM修剪的新方法。我们的方法将结构依赖融入了基于权重大小的不结构化修剪。我们引入了一个针对MLP的修剪指标,通过同时考虑权重的规模和相应MLP中间激活规范来评估每个权重的重要性。DaSS在提供无结构修剪的适应性同时,保留了基于依赖关系的结构化修剪的计算效率。在Mistral和LLA2模型家族的实证评估中,DaSS不仅超越了SparseGPT和Wanda在实现硬件友好的N:M稀疏模式方面的表现,而且保持了Wanda的计算效率。
URL
https://arxiv.org/abs/2405.01943