Paper Reading AI Learner

StableQuant: Layer Adaptive Post-Training Quantization for Speech Foundation Models

2025-04-21 07:33:27
Yeona Hong, Hyewon Han, Woo-jin Chung, Hong-Goo Kang

Abstract

In this paper, we propose StableQuant, a novel adaptive post-training quantization (PTQ) algorithm for widely used speech foundation models (SFMs). While PTQ has been successfully employed for compressing large language models (LLMs) due to its ability to bypass additional fine-tuning, directly applying these techniques to SFMs may not yield optimal results, as SFMs utilize distinct network architecture for feature extraction. StableQuant demonstrates optimal quantization performance regardless of the network architecture type, as it adaptively determines the quantization range for each layer by analyzing both the scale distributions and overall performance. We evaluate our algorithm on two SFMs, HuBERT and wav2vec2.0, for an automatic speech recognition (ASR) task, and achieve superior performance compared to traditional PTQ methods. StableQuant successfully reduces the sizes of SFM models to a quarter and doubles the inference speed while limiting the word error rate (WER) performance drop to less than 0.3% with 8-bit quantization.

Abstract (translated)

在本文中,我们提出了StableQuant,这是一种新型的适应性事后训练量化(PTQ)算法,专门用于广泛使用的语音基础模型(SFM)。尽管由于其能够绕过额外微调的能力,PTQ已经在压缩大型语言模型(LLMs)方面取得了成功,但直接将这些技术应用于SFM可能无法获得最佳效果,因为SFM使用了不同的网络架构来进行特征提取。StableQuant无论网络架构类型如何,都能展示出最优的量化性能,因为它会通过分析比例分布和整体性能来为每一层自适应地确定量化的范围。我们在两个SFM模型——HuBERT和wav2vec2.0上进行自动语音识别(ASR)任务评估,并且与传统的PTQ方法相比,取得了更好的效果。StableQuant成功将SFM模型的大小减少到原来的四分之一,并使推理速度翻倍,同时在使用8位量化的情况下,限制了单词错误率(WER)性能下降不超过0.3%。

URL

https://arxiv.org/abs/2504.14915

PDF

https://arxiv.org/pdf/2504.14915.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot