Rethinking Processing Distortions: Disentangling the Impact of Speech Enhancement Errors on Speech Recognition Performance

Abstract
Abstract (translated)
URL
PDF

Abstract

It is challenging to improve automatic speech recognition (ASR) performance in noisy conditions with a single-channel speech enhancement (SE) front-end. This is generally attributed to the processing distortions caused by the nonlinear processing of single-channel SE front-ends. However, the causes of such degraded ASR performance have not been fully investigated. How to design single-channel SE front-ends in a way that significantly improves ASR performance remains an open research question. In this study, we investigate a signal-level numerical metric that can explain the cause of degradation in ASR performance. To this end, we propose a novel analysis scheme based on the orthogonal projection-based decomposition of SE errors. This scheme manually modifies the ratio of the decomposed interference, noise, and artifact errors, and it enables us to directly evaluate the impact of each error type on ASR performance. Our analysis reveals the particularly detrimental effect of artifact errors on ASR performance compared to the other types of errors. This provides us with a more principled definition of processing distortions that cause the ASR performance degradation. Then, we study two practical approaches for reducing the impact of artifact errors. First, we prove that the simple observation adding (OA) post-processing (i.e., interpolating the enhanced and observed signals) can monotonically improve the signal-to-artifact ratio. Second, we propose a novel training objective, called artifact-boosted signal-to-distortion ratio (AB-SDR), which forces the model to estimate the enhanced signals with fewer artifact errors. Through experiments, we confirm that both the OA and AB-SDR approaches are effective in decreasing artifact errors caused by single-channel SE front-ends, allowing them to significantly improve ASR performance.

Abstract (translated)

在嘈杂条件下,使用单通道语音增强(SE)前端来提高自动语音识别(ASR)性能是一项具有挑战性的任务。通常,这种挑战是由于单通道SE前端非线性处理引起的处理失真。然而,尚未对这种失真导致的ASR性能下降进行全面调查。如何设计一种明显提高ASR性能的单通道SE前端仍然是一个开放的研究问题。在这项研究中,我们研究了一个可以解释ASR性能下降原因的信号级数值度量。为此,我们提出了基于SE误差正交投影分解的分析方案。这种方案手动修改了分解干扰、噪声和伪迹误差的比率,并使我们能够直接评估每种误差类型对ASR性能的影响。我们的分析揭示了伪迹误差对ASR性能的破坏性比其他类型的误差更加严重。然后,我们研究了两种减少伪迹误差影响的方法。首先,我们证明了简单的观察添加(OA)后处理(即插值增强和观察信号)可以单调地改善信号-伪迹比。其次,我们提出了名为伪迹增强信号-畸变比(AB-SDR)的新训练目标,该目标迫使模型使用更少的伪迹误差估计增强信号。通过实验,我们证实了OA和AB-SDR方法都能有效减少由单通道SE前端引起的伪迹误差,从而显著提高ASR性能。

URL

https://arxiv.org/abs/2404.14860

PDF

https://arxiv.org/pdf/2404.14860.pdf

Rethinking Processing Distortions: Disentangling the Impact of Speech Enhancement Errors on Speech Recognition Performance

Abstract

Abstract (translated)

URL

PDF Copy

PDF