Abstract
Speech pre-processing techniques such as denoising, de-reverberation, and separation, are commonly employed as front-ends for various downstream speech processing tasks. However, these methods can sometimes be inadequate, resulting in residual noise or the introduction of new artifacts. Such deficiencies are typically not captured by metrics like SI-SNR but are noticeable to human listeners. To address this, we introduce SpeechRefiner, a post-processing tool that utilizes Conditional Flow Matching (CFM) to improve the perceptual quality of speech. In this study, we benchmark SpeechRefiner against recent task-specific refinement methods and evaluate its performance within our internal processing pipeline, which integrates multiple front-end algorithms. Experiments show that SpeechRefiner exhibits strong generalization across diverse impairment sources, significantly enhancing speech perceptual quality. Audio demos can be found at this https URL.
Abstract (translated)
语音预处理技术,如降噪、消除混响和分离等,通常被用作各种下游语音处理任务的前端。然而,这些方法有时可能不够充分,会导致残留噪声或引入新的伪影。这些问题往往不会被像SI-SNR这样的度量标准捕捉到,但人类听众可以明显察觉到。为了解决这个问题,我们引入了SpeechRefiner,这是一个利用条件流动匹配(CFM)来改善语音感知质量的后处理工具。 在这项研究中,我们将SpeechRefiner与最近的任务特定改进方法进行了基准测试,并评估它在我们的内部处理管道中的性能,该管道集成了多种前端算法。实验结果表明,SpeechRefiner在面对各种不同损伤源时表现出强大的泛化能力,显著提高了语音的感知质量。音频演示可在以下链接中找到:[此URL](请将括号内的文本替换为实际的URL)。
URL
https://arxiv.org/abs/2506.13709