Paper Reading AI Learner

Enhancing Code-Switching Speech Recognition with LID-Based Collaborative Mixture of Experts Model

2024-09-03 16:53:38
Hukai Huang, Jiayan Lin, Kaidi Wang, Yishuang Li, Wenhao Guan, Qingyang Hong, Lin Li

Abstract

Due to the inherent difficulty in modeling phonetic similarities across different languages, code-switching speech recognition presents a formidable challenge. This study proposes a Collaborative-MoE, a Mixture of Experts (MoE) model that leverages a collaborative mechanism among expert groups. Initially, a preceding routing network explicitly learns Language Identification (LID) tasks and selects experts based on acquired LID weights. This process ensures robust routing information to the MoE layer, mitigating interference from diverse language domains on expert network parameter updates. The LID weights are also employed to facilitate inter-group collaboration, enabling the integration of language-specific representations. Furthermore, within each language expert group, a gating network operates unsupervised to foster collaboration on attributes beyond language. Extensive experiments demonstrate the efficacy of our approach, achieving significant performance enhancements compared to alternative methods. Importantly, our method preserves the efficient inference capabilities characteristic of MoE models without necessitating additional pre-training.

Abstract (translated)

由于在不同的语言之间建模音位相似性的固有困难,代码切换语音识别面临着巨大的挑战。本研究提出了一种合作-MoE模型,该模型利用专家群体之间的合作机制。首先,预先路由网络明确学习语言识别(LID)任务,并根据获得的LID权重选择专家。这一过程确保将路由信息传递给MoE层,减轻来自多样语言领域的专家网络参数更新的干扰。LID权重还将促进组间合作,实现语言特定的表示集成。此外,在每种语言专家组中,一个门网络 operates 无监督,以促进在属性超出语言范围的合作。大量实验证明了我们方法的效力,与 alternative 方法相比取得了显著的性能提升。重要的是,我们的方法保留了 MoE 模型的有效推理能力,而无需进行额外的预训练。

URL

https://arxiv.org/abs/2409.02050

PDF

https://arxiv.org/pdf/2409.02050.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot