Abstract
In this paper, we propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture (NSD-MS2S), which integrates a memory-aware multi-speaker embedding module with a sequence-to-sequence architecture. The system leverages a memory module to enhance speaker embeddings and employs a Seq2Seq framework to efficiently map acoustic features to speaker labels. Additionally, we explore the application of mixture of experts in speaker diarization, and introduce a Shared and Soft Mixture of Experts (SS-MoE) module to further mitigate model bias and enhance performance. Incorporating SS-MoE leads to the extended model NSD-MS2S-SSMoE. Experiments on multiple complex acoustic datasets, including CHiME-6, DiPCo, Mixer 6 and DIHARD-III evaluation sets, demonstrate meaningful improvements in robustness and generalization. The proposed methods achieve state-of-the-art results, showcasing their effectiveness in challenging real-world scenarios.
Abstract (translated)
在本文中,我们提出了一种新颖的神经说话人识别系统(NSD-MS2S),该系统使用带有序列到序列架构的记忆感知多说话人嵌入模块。该系统利用记忆模块增强说话人的嵌入,并采用Seq2Seq框架将声学特征高效地映射为说话人标签。此外,我们还探讨了在说话人识别中应用专家混合模型的方法,并引入了一种共享和软专家混合(SS-MoE)模块来进一步减轻模型偏差并提升性能。集成SS-MoE后形成了扩展模型NSD-MS2S-SSMoE。在多个复杂的声学数据集上进行的实验,包括CHiME-6、DiPCo、Mixer 6和DIHARD-III评测集,证明了该方法在鲁棒性和泛化能力方面的显著改进。所提出的方法取得了最先进的结果,在具有挑战性的实际场景中显示出了其有效性。
URL
https://arxiv.org/abs/2506.14750