Abstract
Recent works have indicated redundancy across transformer blocks, prompting the research of depth compression to prune less crucial blocks. However, current ways of entire-block pruning suffer from risks of discarding meaningful cues learned in those blocks, leading to substantial performance degradation. As another line of model compression, channel pruning can better preserve performance, while it cannot reduce model depth and is challenged by inconsistent pruning ratios for individual layers. To pursue better model compression and acceleration, this paper proposes \textbf{FlattenGPT}, a novel way to detect and reduce depth-wise redundancies. By flatting two adjacent blocks into one, it compresses the network depth, meanwhile enables more effective parameter redundancy detection and removal. FlattenGPT allows to preserve the knowledge learned in all blocks, and remains consistent with the original transformer architecture. Extensive experiments demonstrate that FlattenGPT enhances model efficiency with a decent trade-off to performance. It outperforms existing pruning methods in both zero-shot accuracies and WikiText-2 perplexity across various model types and parameter sizes. On LLaMA-2/3 and Qwen-1.5 models, FlattenGPT retains 90-96\% of zero-shot performance with a compression ratio of 20\%. It also outperforms other pruning methods in accelerating LLM inference, making it promising for enhancing the efficiency of transformers.
Abstract (translated)
最近的研究表明,Transformer 块之间存在冗余性,这促使了深度压缩研究的发展,以修剪掉不太重要的块。然而,目前的整块剪枝方法面临着丢弃那些块中学到的重要信息的风险,导致性能显著下降。作为另一种模型压缩途径,通道剪枝可以更好地保留性能,但其无法减少模型深度,并且面临各层独立修剪比例不一致的问题。为了追求更好的模型压缩和加速效果,本文提出了一种名为\textbf{FlattenGPT}的新方法,旨在检测并减少深度方向上的冗余性。通过将两个相邻的块合并为一个块,它可以压缩网络深度,同时能够更有效地识别和移除参数冗余。FlattenGPT 允许保留所有块中学到的知识,并且与原始 Transformer 架构保持一致。广泛的实验表明,FlattenGPT 能够在性能上实现良好的权衡以提升模型效率。无论是在零样本准确率还是 WikiText-2 上的困惑度指标中,它都超越了现有的剪枝方法,在各种模型类型和参数大小下均表现出色。对于 LLaMA-2/3 和 Qwen-1.5 模型而言,FlattenGPT 在压缩比为 20\% 的情况下能够保持 90%-96\% 的零样本性能。此外,它在加速大型语言模型推理方面也优于其他剪枝方法,显示出提高 Transformer 效率的潜力。
URL
https://arxiv.org/abs/2602.08858