Abstract
One key aspect differentiating data-driven single- and multi-channel speech enhancement and dereverberation methods is that both the problem formulation and complexity of the solutions are considerably more challenging in the latter case. Additionally, with limited computational resources, it is cumbersome to train models that require the management of larger datasets or those with more complex designs. In this scenario, an unverified hypothesis that single-channel methods can be adapted to multi-channel scenarios simply by processing each channel independently holds significant implications, boosting compatibility between sound scene capture and system input-output formats, while also allowing modern research to focus on other challenging aspects, such as full-bandwidth audio enhancement, competitive noise suppression, and unsupervised learning. This study verifies this hypothesis by comparing the enhancement promoted by a basic single-channel speech enhancement and dereverberation model with two other multi-channel models tailored to separate clean speech from noisy 3D mixes. A direction of arrival estimation model was used to objectively evaluate its capacity to preserve spatial information by comparing the output signals with ground-truth coordinate values. Consequently, a trade-off arises between preserving spatial information with a more straightforward single-channel solution at the cost of obtaining lower gains in intelligibility scores.
Abstract (translated)
数据驱动的单通道和多通道语音增强和去噪方法的一个重要区别是,后者的问题表述和解决方案的复杂性大大增加。此外,在有限计算资源的情况下,训练需要管理更大数据集或更复杂设计的模型非常费力。在这种情况下,一个未经证实的假设是,单通道方法可以简单地适应多通道场景,只需对每个通道独立处理,这对声景捕捉系统和输入-输出格式之间的兼容性产生了重大影响,同时也允许现代研究集中精力于其他具有挑战性的方面,例如全带宽音频增强、竞争性噪声抑制和无监督学习。通过比较基本单通道语音增强和去噪模型与两个专门针对分离干净语音和嘈杂3D混合的Multi-Channel模型的增强效果,本研究验证了这个假设。采用到达方向估计模型通过比较输出信号与地面坐标值来客观评估其保留空间信息的能力。因此,在保留空间信息方面,更简单的单通道解决方案在获得较低的增益智能分数的同时,需要在清晰度分数上做出让步。
URL
https://arxiv.org/abs/2404.14564