Paper Reading AI Learner

Efficient Compression of Multitask Multilingual Speech Models

2024-05-02 03:11:59
Thomas Palmeira Ferraz

Abstract

Whisper is a multitask and multilingual speech model covering 99 languages. It yields commendable automatic speech recognition (ASR) results in a subset of its covered languages, but the model still underperforms on a non-negligible number of under-represented languages, a problem exacerbated in smaller model versions. In this work, we examine its limitations, demonstrating the presence of speaker-related (gender, age) and model-related (resourcefulness and model size) bias. Despite that, we show that only model-related bias are amplified by quantization, impacting more low-resource languages and smaller models. Searching for a better compression approach, we propose DistilWhisper, an approach that is able to bridge the performance gap in ASR for these languages while retaining the advantages of multitask and multilingual capabilities. Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2. This dual approach allows us to effectively boost ASR performance while keeping the robustness inherited from the multitask and multilingual pre-training. Results demonstrate that our approach is more effective than standard fine-tuning or LoRA adapters, boosting performance in the targeted languages for both in- and out-of-domain test sets, while introducing only a negligible parameter overhead at inference.

Abstract (translated)

Whisper 是一个多任务、多语言的语音模型,覆盖 99 种语言。在它所覆盖的语言的范围内,它产生了可观的自动语音识别(ASR)结果,但在一些未涵盖的语言上,该模型仍然表现不佳,尤其是在较小的模型版本中,问题更加严重。在这项工作中,我们研究了其局限性,证明了说话者相关(性别、年龄)和模型相关(资源丰富性和模型大小)偏见的存在。尽管如此,我们证明了仅模型相关偏见通过量化被放大,影响更多的低资源语言和较小的模型。为了寻找更好的压缩方法,我们提出了 DistilWhisper,一种能够同时提高这些语言的ASR性能,同时保留多任务和多语言功能优势的方法。我们的方法涉及两个关键策略:通过语言特定的专家对Whisper-small进行轻量级模块化ASR微调,以及从Whisper-large-v2中进行知识蒸馏。这种双策略使我们能够在有效提高ASR性能的同时保持多任务和多语言预训练所带来的稳健性。结果表明,我们的方法比标准的微调或LoRA适配器更有效,在目标语言的测试集中提高性能,同时仅引入了微不足道的参数开销。

URL

https://arxiv.org/abs/2405.00966

PDF

https://arxiv.org/pdf/2405.00966.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot