Paper Reading AI Learner

Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

2024-11-04 17:59:04
Yuqi Luo, Chenyang Song, Xu Han, Yingfa Chen, Chaojun Xiao, Zhiyuan Liu, Maosong Sun

Abstract

Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated, benefiting many important applications concerned with large language models (LLMs). Although promoting greater activation sparsity within LLMs deserves deep studies, existing works lack comprehensive and quantitative research on the correlation between activation sparsity and potentially influential factors. In this paper, we present a comprehensive study on the quantitative scaling properties and influential factors of the activation sparsity within decoder-only Transformer-based LLMs. Specifically, we propose PPL-$p\%$ sparsity, a precise and performance-aware activation sparsity metric that is applicable to any activation function. Through extensive experiments, we find several important phenomena. Firstly, different activation functions exhibit comparable performance but opposite training-time sparsity trends. The activation ratio (i.e., $1-\mathrm{sparsity\ ratio}$) evolves as a convergent increasing power-law and decreasing logspace power-law with the amount of training data for SiLU-activated and ReLU-activated LLMs, respectively. These demonstrate that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity. Secondly, the activation ratio linearly increases with the width-depth ratio below a certain bottleneck point, indicating the potential advantage of a deeper architecture at a fixed parameter scale. Finally, at similar width-depth ratios, we surprisingly find that the limit value of activation sparsity varies weakly with the parameter scale, i.e., the activation patterns within LLMs are insensitive to the parameter scale. These empirical laws towards LLMs with greater activation sparsity have important implications for making LLMs more efficient and interpretable.

Abstract (translated)

激活稀疏性表示在激活输出中存在大量贡献较小的元素,这些元素可以被消除,并且这对涉及大型语言模型(LLM)的重要应用有好处。尽管促进LLMs中的更大激活稀疏性值得深入研究,但现有工作缺乏对激活稀疏性和潜在影响因素之间相关性的全面和定量研究。本文提出了关于解码器仅基于Transformer的LLMs中激活稀疏性的量化扩展特性和影响因素的全面研究。具体而言,我们提出了一种精确且性能感知的激活稀疏性度量PPL-$p\%$稀疏性,该度量适用于任何激活函数。通过广泛的实验,我们发现了几个重要的现象。首先,不同的激活函数表现出相近的性能但训练时的稀疏趋势相反。对于SiLU激活和ReLU激活的LLMs,随着训练数据量的增加,激活比(即$1-\mathrm{稀疏性比例}$)分别以收敛递增幂律和递减对数空间幂律演变。这表明在作为激活函数方面,ReLU比SiLU更有效,并且可以利用更多的训练数据来改善激活稀疏性。其次,在某个瓶颈点以下,激活比率随宽度-深度比线性增加,表明固定参数规模下深层架构具有潜在优势。最后,在类似的宽度-深度比条件下,我们惊讶地发现激活稀疏性的极限值与参数规模的弱相关性,即LLMs内的激活模式对参数规模不敏感。这些关于具有更大激活稀疏性的LLM的经验法则对于提高LLMs的效率和可解释性有重要的启示意义。

URL

https://arxiv.org/abs/2411.02335

PDF

https://arxiv.org/pdf/2411.02335.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot