Abstract
Recent deep learning approaches have achieved impressive performance on speech enhancement and separation tasks. However, these approaches have not been investigated for separating mixtures of arbitrary sounds of different types, a task we refer to as universal sound separation, and it is unknown whether performance on speech tasks carries over to non-speech tasks. To study this question, we develop a universal dataset of mixtures containing arbitrary sounds, and use it to investigate the space of mask-based separation architectures, varying both the overall network architecture and the framewise analysis-synthesis basis for signal transformations. These network architectures include convolutional long short-term memory networks and time-dilated convolution stacks inspired by the recent success of time-domain enhancement networks like ConvTasNet. For the latter architecture, we also propose novel modifications that further improve separation performance. In terms of the framewise analysis-synthesis basis, we explore using either a short-time Fourier transform (STFT) or a learnable basis, as used in ConvTasNet, and for both of these bases, we examine the effect of window size. In particular, for STFTs, we find that longer windows (25-50 ms) work best for speech/non-speech separation, while shorter windows (2.5 ms) work best for arbitrary sounds. For learnable bases, shorter windows (2.5 ms) work best on all tasks. Surprisingly, for universal sound separation, STFTs outperform learnable bases. Our best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.
Abstract (translated)
最近的深度学习方法在语音增强和分离任务上取得了令人印象深刻的效果。然而,这些方法还没有被用于分离不同类型的任意声音的混合物,我们称之为通用声音分离的任务,语音任务的性能是否会转移到非语音任务上还不得而知。为了研究这个问题,我们开发了一个包含任意声音的混合的通用数据集,并使用它来研究基于屏蔽的分离结构的空间,改变了整个网络结构和信号转换的帧分析合成基础。这些网络体系结构包括卷积的长期短期内存网络和时间扩展卷积堆栈,这些都是受最近诸如convtasnet之类的时域增强网络的成功启发而设计的。对于后一种架构,我们还提出了新的修改,进一步提高分离性能。在框架分析综合的基础上,我们探索使用短时傅立叶变换(STFT)或学习的基础,正如在convtasnet中使用的一样,对于这两个基础,我们检查窗口大小的影响。特别是,对于stfts,我们发现较长的窗口(25-50 ms)最适合语音/非语音分离,而较短的窗口(2.5 ms)最适合任意声音。对于可学习的基础,较短的窗口(2.5毫秒)在所有任务上都能发挥最佳效果。令人惊讶的是,对于通用的声音分离,stfts优于可学习的基础。我们的最佳方法使语音/非语音分离的比例不变信噪比提高了13分贝以上,通用声音分离的比例不变信噪比提高了近10分贝。
URL
https://arxiv.org/abs/1905.03330