Paper Reading AI Learner

Deep neural network based speech separation optimizing an objective estimator of intelligibility for low latency applications

2018-07-18 12:55:59
Gaurav Naithani, Joonas Nikunen, Lars Bramsløw, Tuomas Virtanen

Abstract

Mean square error (MSE) has been the preferred choice as loss function in the current deep neural network (DNN) based speech separation techniques. In this paper, we propose a new cost function with the aim of optimizing the extended short time objective intelligibility (ESTOI) measure. We focus on applications where low algorithmic latency ($\leq 10$ ms) is important. We use long short-term memory networks (LSTM) and evaluate our proposed approach on four sets of two-speaker mixtures from extended Danish hearing in noise (HINT) dataset. We show that the proposed loss function can offer improved or at par objective intelligibility (in terms of ESTOI) compared to an MSE optimized baseline while resulting in lower objective separation performance (in terms of the source to distortion ratio (SDR)). We then proceed to propose an approach where the network is first initialized with weights optimized for MSE criterion and then trained with the proposed ESTOI loss criterion. This approach mitigates some of the losses in objective separation performance while preserving the gains in objective intelligibility.

Abstract (translated)

均方误差(MSE)已成为当前基于深度神经网络(DNN)的语音分离技术中的损失函数的首选。在本文中,我们提出了一个新的成本函数,旨在优化扩展的短时目标可懂度(ESTOI)度量。我们专注于低算法延迟($ \ leq 10 $ ms)非常重要的应用。我们使用长期短期记忆网络(LSTM)并评估我们提出的方法对来自扩展丹麦噪声听觉(HINT)数据集的四组双扬声器混合。我们表明,与MSE优化基线相比,所提出的损失函数可以提供改进的或相似的客观可懂度(就ESTOI而言),同时导致较低的客观分离性能(就源与失真比(SDR)而言)。然后,我们继续提出一种方法,首先使用针对MSE标准优化的权重初始化网络,然后使用所提出的ESTOI损失标准进行训练。这种方法减轻了客观分离性能的一些损失,同时保留了客观可懂度的增益。

URL

https://arxiv.org/abs/1807.06899

PDF

https://arxiv.org/pdf/1807.06899.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot