Abstract
In this paper, we propose StableQuant, a novel adaptive post-training quantization (PTQ) algorithm for widely used speech foundation models (SFMs). While PTQ has been successfully employed for compressing large language models (LLMs) due to its ability to bypass additional fine-tuning, directly applying these techniques to SFMs may not yield optimal results, as SFMs utilize distinct network architecture for feature extraction. StableQuant demonstrates optimal quantization performance regardless of the network architecture type, as it adaptively determines the quantization range for each layer by analyzing both the scale distributions and overall performance. We evaluate our algorithm on two SFMs, HuBERT and wav2vec2.0, for an automatic speech recognition (ASR) task, and achieve superior performance compared to traditional PTQ methods. StableQuant successfully reduces the sizes of SFM models to a quarter and doubles the inference speed while limiting the word error rate (WER) performance drop to less than 0.3% with 8-bit quantization.
Abstract (translated)
在本文中,我们提出了StableQuant,这是一种新型的适应性事后训练量化(PTQ)算法,专门用于广泛使用的语音基础模型(SFM)。尽管由于其能够绕过额外微调的能力,PTQ已经在压缩大型语言模型(LLMs)方面取得了成功,但直接将这些技术应用于SFM可能无法获得最佳效果,因为SFM使用了不同的网络架构来进行特征提取。StableQuant无论网络架构类型如何,都能展示出最优的量化性能,因为它会通过分析比例分布和整体性能来为每一层自适应地确定量化的范围。我们在两个SFM模型——HuBERT和wav2vec2.0上进行自动语音识别(ASR)任务评估,并且与传统的PTQ方法相比,取得了更好的效果。StableQuant成功将SFM模型的大小减少到原来的四分之一,并使推理速度翻倍,同时在使用8位量化的情况下,限制了单词错误率(WER)性能下降不超过0.3%。
URL
https://arxiv.org/abs/2504.14915