Paper Reading AI Learner

Universal Sound Separation

2019-05-08 20:48:49
Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, John R. Hershey

Abstract

Recent deep learning approaches have achieved impressive performance on speech enhancement and separation tasks. However, these approaches have not been investigated for separating mixtures of arbitrary sounds of different types, a task we refer to as universal sound separation, and it is unknown whether performance on speech tasks carries over to non-speech tasks. To study this question, we develop a universal dataset of mixtures containing arbitrary sounds, and use it to investigate the space of mask-based separation architectures, varying both the overall network architecture and the framewise analysis-synthesis basis for signal transformations. These network architectures include convolutional long short-term memory networks and time-dilated convolution stacks inspired by the recent success of time-domain enhancement networks like ConvTasNet. For the latter architecture, we also propose novel modifications that further improve separation performance. In terms of the framewise analysis-synthesis basis, we explore using either a short-time Fourier transform (STFT) or a learnable basis, as used in ConvTasNet, and for both of these bases, we examine the effect of window size. In particular, for STFTs, we find that longer windows (25-50 ms) work best for speech/non-speech separation, while shorter windows (2.5 ms) work best for arbitrary sounds. For learnable bases, shorter windows (2.5 ms) work best on all tasks. Surprisingly, for universal sound separation, STFTs outperform learnable bases. Our best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.

Abstract (translated)

最近的深度学习方法在语音增强和分离任务上取得了令人印象深刻的效果。然而,这些方法还没有被用于分离不同类型的任意声音的混合物,我们称之为通用声音分离的任务,语音任务的性能是否会转移到非语音任务上还不得而知。为了研究这个问题,我们开发了一个包含任意声音的混合的通用数据集,并使用它来研究基于屏蔽的分离结构的空间,改变了整个网络结构和信号转换的帧分析合成基础。这些网络体系结构包括卷积的长期短期内存网络和时间扩展卷积堆栈,这些都是受最近诸如convtasnet之类的时域增强网络的成功启发而设计的。对于后一种架构,我们还提出了新的修改,进一步提高分离性能。在框架分析综合的基础上,我们探索使用短时傅立叶变换(STFT)或学习的基础,正如在convtasnet中使用的一样,对于这两个基础,我们检查窗口大小的影响。特别是,对于stfts,我们发现较长的窗口(25-50 ms)最适合语音/非语音分离,而较短的窗口(2.5 ms)最适合任意声音。对于可学习的基础,较短的窗口(2.5毫秒)在所有任务上都能发挥最佳效果。令人惊讶的是,对于通用的声音分离,stfts优于可学习的基础。我们的最佳方法使语音/非语音分离的比例不变信噪比提高了13分贝以上,通用声音分离的比例不变信噪比提高了近10分贝。

URL

https://arxiv.org/abs/1905.03330

PDF

https://arxiv.org/pdf/1905.03330.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot