Abstract
Isolating the desired speaker's voice amidst multiplespeakers in a noisy acoustic context is a challenging task. Per-sonalized speech enhancement (PSE) endeavours to achievethis by leveraging prior knowledge of the speaker's voice.Recent research efforts have yielded promising PSE mod-els, albeit often accompanied by computationally intensivearchitectures, unsuitable for resource-constrained embeddeddevices. In this paper, we introduce a novel method to per-sonalize a lightweight dual-stage Speech Enhancement (SE)model and implement it within DeepFilterNet2, a SE modelrenowned for its state-of-the-art performance. We seek anoptimal integration of speaker information within the model,exploring different positions for the integration of the speakerembeddings within the dual-stage enhancement architec-ture. We also investigate a tailored training strategy whenadapting DeepFilterNet2 to a PSE task. We show that ourpersonalization method greatly improves the performancesof DeepFilterNet2 while preserving minimal computationaloverhead.
Abstract (translated)
在一个噪音干扰的安静环境中,分离所需发言者的声音是一项具有挑战性的任务。为了实现这一目标,个人化语音增强(PSE)方法利用了发言者声音的先前知识。尽管最近的研究已经产生了有前景的PSE模型,但通常附带计算密集型架构,不适合资源受限的嵌入式设备。在本文中,我们提出了一种新的方法,对轻量级的双级语音增强(SE)模型进行个性化,并将其实现在大卫滤波器网络2中,该网络因其最先进的性能而闻名。我们寻求在模型中优化发言者信息的最佳 integration 位置,探讨将发言者嵌入在双级增强架构中的不同位置。我们还研究了在将大卫滤波器网络2适应PSE任务时如何实现适当的训练策略。我们证明了我们的个性化方法在提高DeepFilterNet2的性能的同时,保留了最小的计算开销。
URL
https://arxiv.org/abs/2404.08022