Paper Reading AI Learner

FlattenGPT: Depth Compression for Transformer with Layer Flattening

2026-02-09 16:22:58
Ruihan Xu, Qingpei Guo, Yao Zhu, Xiangyang Ji, Ming Yang, Shiliang Zhang

Abstract

Recent works have indicated redundancy across transformer blocks, prompting the research of depth compression to prune less crucial blocks. However, current ways of entire-block pruning suffer from risks of discarding meaningful cues learned in those blocks, leading to substantial performance degradation. As another line of model compression, channel pruning can better preserve performance, while it cannot reduce model depth and is challenged by inconsistent pruning ratios for individual layers. To pursue better model compression and acceleration, this paper proposes \textbf{FlattenGPT}, a novel way to detect and reduce depth-wise redundancies. By flatting two adjacent blocks into one, it compresses the network depth, meanwhile enables more effective parameter redundancy detection and removal. FlattenGPT allows to preserve the knowledge learned in all blocks, and remains consistent with the original transformer architecture. Extensive experiments demonstrate that FlattenGPT enhances model efficiency with a decent trade-off to performance. It outperforms existing pruning methods in both zero-shot accuracies and WikiText-2 perplexity across various model types and parameter sizes. On LLaMA-2/3 and Qwen-1.5 models, FlattenGPT retains 90-96\% of zero-shot performance with a compression ratio of 20\%. It also outperforms other pruning methods in accelerating LLM inference, making it promising for enhancing the efficiency of transformers.

Abstract (translated)

最近的研究表明,Transformer 块之间存在冗余性,这促使了深度压缩研究的发展,以修剪掉不太重要的块。然而,目前的整块剪枝方法面临着丢弃那些块中学到的重要信息的风险,导致性能显著下降。作为另一种模型压缩途径,通道剪枝可以更好地保留性能,但其无法减少模型深度,并且面临各层独立修剪比例不一致的问题。为了追求更好的模型压缩和加速效果,本文提出了一种名为\textbf{FlattenGPT}的新方法,旨在检测并减少深度方向上的冗余性。通过将两个相邻的块合并为一个块,它可以压缩网络深度,同时能够更有效地识别和移除参数冗余。FlattenGPT 允许保留所有块中学到的知识,并且与原始 Transformer 架构保持一致。广泛的实验表明,FlattenGPT 能够在性能上实现良好的权衡以提升模型效率。无论是在零样本准确率还是 WikiText-2 上的困惑度指标中,它都超越了现有的剪枝方法,在各种模型类型和参数大小下均表现出色。对于 LLaMA-2/3 和 Qwen-1.5 模型而言,FlattenGPT 在压缩比为 20\% 的情况下能够保持 90%-96\% 的零样本性能。此外,它在加速大型语言模型推理方面也优于其他剪枝方法,显示出提高 Transformer 效率的潜力。

URL

https://arxiv.org/abs/2602.08858

PDF

https://arxiv.org/pdf/2602.08858.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot