Paper Reading AI Learner

An Effective Training Framework for Light-Weight Automatic Speech Recognition Models

2025-05-22 17:55:09
Abdul Hannan, Alessio Brutti, Shah Nawaz, Mubashir Noman

Abstract

Recent advancement in deep learning encouraged developing large automatic speech recognition (ASR) models that achieve promising results while ignoring computational and memory constraints. However, deploying such models on low resource devices is impractical despite of their favorable performance. Existing approaches (pruning, distillation, layer skip etc.) transform the large models into smaller ones at the cost of significant performance degradation or require prolonged training of smaller models for better performance. To address these issues, we introduce an efficacious two-step representation learning based approach capable of producing several small sized models from a single large model ensuring considerably better performance in limited number of epochs. Comprehensive experimentation on ASR benchmarks reveals the efficacy of our approach, achieving three-fold training speed-up and up to 12.54% word error rate improvement.

Abstract (translated)

最近的深度学习进展鼓励开发出了一系列大规模自动语音识别(ASR)模型,这些模型在忽略计算和内存限制的情况下取得了令人鼓舞的结果。然而,在资源有限的设备上部署这样的大模型是不切实际的,尽管它们有良好的性能表现。现有的方法(如剪枝、蒸馏、跳过层等),虽然可以将大型模型转换为较小的模型,但会导致显著的性能下降或需要长时间训练小型模型以获得更好的性能。 为了应对这些问题,我们提出了一种有效的两步表示学习方法,可以从单个大规模模型中生成多个小规模模型,并确保在有限的训练周期内有相当不错的性能表现。我们在ASR基准测试上的全面实验表明了该方法的有效性,实现了三倍的训练速度提升,并且错误词率(WER)最多减少了12.54%。

URL

https://arxiv.org/abs/2505.16991

PDF

https://arxiv.org/pdf/2505.16991.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot