Paper Reading AI Learner

Model-Preserving Adaptive Rounding

2025-05-29 01:53:00
Albert Tseng, Zhaofeng Sun, Christopher De Sa

Abstract

The main goal of post-training quantization (PTQ) is to produced a compressed model whose output distribution is as close to the original model's as possible. To do this tractably, almost all LLM PTQ algorithms quantize linear layers by independently minimizing the immediate activation error. However, this localized objective ignores the effect of subsequent layers, so reducing it does not necessarily give a closer model. In this work, we introduce Yet Another Quantization Algorithm (YAQA), an adaptive rounding algorithm that uses Kronecker-factored approximations of each linear layer's Hessian with respect to the \textit{full model} KL divergence. YAQA consists of two components: Kronecker-factored sketches of the full layerwise Hessian that can be tractably computed for hundred-billion parameter LLMs, and a quantizer-independent rounding algorithm that uses these sketches and comes with theoretical guarantees. Across a wide range of models and quantizers, YAQA empirically reduces the KL divergence to the original model by $\approx 30\%$ while achieving state of the art performance on downstream tasks.

Abstract (translated)

后训练量化(PTQ)的主要目标是生成一个压缩后的模型,该模型的输出分布尽可能接近原始模型。为了实现这一目标,几乎所有的大规模语言模型(LLM)PTQ算法都通过独立地最小化即时激活误差来量化线性层。然而,这种局部化的优化目标忽略了后续层的影响,因此减少它并不一定意味着生成了更接近原模型的结果。 在本文中,我们介绍了一种新的量化算法——又一种量化算法(YAQA),这是一种自适应舍入算法,使用每个线性层相对于整个模型KL散度的Kronecker因子近似Hessian矩阵。YAQA由两个部分组成:可以为百亿参数的大规模语言模型有效计算出完整的逐层Hessian矩阵的Kronecker因子化草图,以及一种独立于量化器的舍入算法,该算法使用这些草图并具备理论保证。 在广泛的模型和量化器组合中,实验结果表明,与现有技术相比,YAQA能将压缩后的模型与原模型之间的KL散度降低约30%,同时在下游任务上达到最先进的性能。

URL

https://arxiv.org/abs/2505.22988

PDF

https://arxiv.org/pdf/2505.22988.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot