Abstract
The main goal of post-training quantization (PTQ) is to produced a compressed model whose output distribution is as close to the original model's as possible. To do this tractably, almost all LLM PTQ algorithms quantize linear layers by independently minimizing the immediate activation error. However, this localized objective ignores the effect of subsequent layers, so reducing it does not necessarily give a closer model. In this work, we introduce Yet Another Quantization Algorithm (YAQA), an adaptive rounding algorithm that uses Kronecker-factored approximations of each linear layer's Hessian with respect to the \textit{full model} KL divergence. YAQA consists of two components: Kronecker-factored sketches of the full layerwise Hessian that can be tractably computed for hundred-billion parameter LLMs, and a quantizer-independent rounding algorithm that uses these sketches and comes with theoretical guarantees. Across a wide range of models and quantizers, YAQA empirically reduces the KL divergence to the original model by $\approx 30\%$ while achieving state of the art performance on downstream tasks.
Abstract (translated)
后训练量化(PTQ)的主要目标是生成一个压缩后的模型,该模型的输出分布尽可能接近原始模型。为了实现这一目标,几乎所有的大规模语言模型(LLM)PTQ算法都通过独立地最小化即时激活误差来量化线性层。然而,这种局部化的优化目标忽略了后续层的影响,因此减少它并不一定意味着生成了更接近原模型的结果。 在本文中,我们介绍了一种新的量化算法——又一种量化算法(YAQA),这是一种自适应舍入算法,使用每个线性层相对于整个模型KL散度的Kronecker因子近似Hessian矩阵。YAQA由两个部分组成:可以为百亿参数的大规模语言模型有效计算出完整的逐层Hessian矩阵的Kronecker因子化草图,以及一种独立于量化器的舍入算法,该算法使用这些草图并具备理论保证。 在广泛的模型和量化器组合中,实验结果表明,与现有技术相比,YAQA能将压缩后的模型与原模型之间的KL散度降低约30%,同时在下游任务上达到最先进的性能。
URL
https://arxiv.org/abs/2505.22988