Abstract
There is increasing interest in the use of the LEArnable Front-end (LEAF) in a variety of speech processing systems. However, there is a dearth of analyses of what is actually learnt and the relative importance of training the different components of the front-end. In this paper, we investigate this question on keyword spotting, speech-based emotion recognition and language identification tasks and find that the filters for spectral decomposition and the low pass filter used to estimate spectral energy variations exhibit no learning and the per-channel energy normalisation (PCEN) is the key component that is learnt. Following this, we explore the potential of adapting only the PCEN layer with a small amount of noisy data to enable it to learn appropriate dynamic range compression that better suits the noise conditions. This in turn enables a system trained on clean speech to work more accurately on noisy test data as demonstrated by the experimental results reported in this paper.
Abstract (translated)
在各种语音处理系统中,使用LEarnable Front-end (LEAF)越来越受到关注。然而,目前缺乏对实际学习的分析和不同组件训练的相对重要性。在本文中,我们对关键词抽样、基于语音的情感识别和语言识别任务进行了研究,并发现用于谱分解的滤波器和用于估计谱能量变化的小通滤波器没有学习,而每通道能量归一化(PCEN)是关键的学习组件。接着,我们探讨了仅使用少量噪声数据来调整PCEN层的可能性,以使它能够学习适当的动态范围压缩,更好地适应噪声条件。这将使得在干净语音上训练的系统在噪声测试数据上更准确地工作,正如本文中报告的实验结果所证明。
URL
https://arxiv.org/abs/2404.06702