Abstract
Speech enhancement, particularly denoising, is vital in improving the intelligibility and quality of speech signals for real-world applications, especially in noisy environments. While prior research has introduced various deep learning models for this purpose, many struggle to balance noise suppression, perceptual quality, and speaker-specific feature preservation, leaving a critical research gap in their comparative performance evaluation. This study benchmarks three state-of-the-art models Wave-U-Net, CMGAN, and U-Net, on diverse datasets such as SpEAR, VPQAD, and Clarkson datasets. These models were chosen due to their relevance in the literature and code accessibility. The evaluation reveals that U-Net achieves high noise suppression with SNR improvements of +71.96% on SpEAR, +64.83% on VPQAD, and +364.2% on the Clarkson dataset. CMGAN outperforms in perceptual quality, attaining the highest PESQ scores of 4.04 on SpEAR and 1.46 on VPQAD, making it well-suited for applications prioritizing natural and intelligible speech. Wave-U-Net balances these attributes with improvements in speaker-specific feature retention, evidenced by VeriSpeak score gains of +10.84% on SpEAR and +27.38% on VPQAD. This research indicates how advanced methods can optimize trade-offs between noise suppression, perceptual quality, and speaker recognition. The findings may contribute to advancing voice biometrics, forensic audio analysis, telecommunication, and speaker verification in challenging acoustic conditions.
Abstract (translated)
语音增强,尤其是降噪技术,在改善真实世界应用场景中语音信号的可懂度和质量方面至关重要,尤其是在噪音环境中。尽管此前的研究已经提出了各种用于此目的的深度学习模型,但许多模型在噪声抑制、感知质量以及说话人特定特征保留之间难以取得平衡,留下了比较性能评估中的一个重要研究缺口。本研究对Wave-U-Net、CMGAN和U-Net这三种最先进的模型,在SpEAR、VPQAD和Clarkson数据集等多样化数据集上进行了基准测试。这些模型因其在文献中的相关性和代码可获取性而被选中进行研究。 评价结果表明,U-Net在噪声抑制方面表现出色,在SpEAR数据集上的信噪比(SNR)提高了71.96%,VPQAD数据集上提高了64.83%,Clarkson数据集上则提高了364.2%。CMGAN模型在感知质量方面表现优异,分别在SpEAR和VPQAD数据集中获得了最高的PESQ评分4.04和1.46,使其非常适合需要自然且易于理解的语音的应用场景。Wave-U-Net模型在保留说话人特定特征的同时也实现了噪声抑制方面的改进,这体现在VeriSpeak评分上的提升:在SpEAR数据集上提高了10.84%,VPQAD数据集上则提升了27.38%。 这项研究揭示了先进方法如何优化噪声抑制、感知质量和说话人识别之间的权衡。该研究的发现可能会推进语音生物识别技术、法医音频分析、电信通讯和在复杂声学条件下的说话人验证等领域的发展。
URL
https://arxiv.org/abs/2506.15000