Abstract
Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated, benefiting many important applications concerned with large language models (LLMs). Although promoting greater activation sparsity within LLMs deserves deep studies, existing works lack comprehensive and quantitative research on the correlation between activation sparsity and potentially influential factors. In this paper, we present a comprehensive study on the quantitative scaling properties and influential factors of the activation sparsity within decoder-only Transformer-based LLMs. Specifically, we propose PPL-$p\%$ sparsity, a precise and performance-aware activation sparsity metric that is applicable to any activation function. Through extensive experiments, we find several important phenomena. Firstly, different activation functions exhibit comparable performance but opposite training-time sparsity trends. The activation ratio (i.e., $1-\mathrm{sparsity\ ratio}$) evolves as a convergent increasing power-law and decreasing logspace power-law with the amount of training data for SiLU-activated and ReLU-activated LLMs, respectively. These demonstrate that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity. Secondly, the activation ratio linearly increases with the width-depth ratio below a certain bottleneck point, indicating the potential advantage of a deeper architecture at a fixed parameter scale. Finally, at similar width-depth ratios, we surprisingly find that the limit value of activation sparsity varies weakly with the parameter scale, i.e., the activation patterns within LLMs are insensitive to the parameter scale. These empirical laws towards LLMs with greater activation sparsity have important implications for making LLMs more efficient and interpretable.
Abstract (translated)
激活稀疏性表示在激活输出中存在大量贡献较小的元素,这些元素可以被消除,并且这对涉及大型语言模型(LLM)的重要应用有好处。尽管促进LLMs中的更大激活稀疏性值得深入研究,但现有工作缺乏对激活稀疏性和潜在影响因素之间相关性的全面和定量研究。本文提出了关于解码器仅基于Transformer的LLMs中激活稀疏性的量化扩展特性和影响因素的全面研究。具体而言,我们提出了一种精确且性能感知的激活稀疏性度量PPL-$p\%$稀疏性,该度量适用于任何激活函数。通过广泛的实验,我们发现了几个重要的现象。首先,不同的激活函数表现出相近的性能但训练时的稀疏趋势相反。对于SiLU激活和ReLU激活的LLMs,随着训练数据量的增加,激活比(即$1-\mathrm{稀疏性比例}$)分别以收敛递增幂律和递减对数空间幂律演变。这表明在作为激活函数方面,ReLU比SiLU更有效,并且可以利用更多的训练数据来改善激活稀疏性。其次,在某个瓶颈点以下,激活比率随宽度-深度比线性增加,表明固定参数规模下深层架构具有潜在优势。最后,在类似的宽度-深度比条件下,我们惊讶地发现激活稀疏性的极限值与参数规模的弱相关性,即LLMs内的激活模式对参数规模不敏感。这些关于具有更大激活稀疏性的LLM的经验法则对于提高LLMs的效率和可解释性有重要的启示意义。
URL
https://arxiv.org/abs/2411.02335