Paper Reading AI Learner

Convolutional Neural Networks to Enhance Coded Speech

2018-06-25 12:20:55
Ziyue Zhao, Huijun Liu, Tim Fingscheidt

Abstract

Enhancing coded speech suffering from far-end acoustic background noise, quantization noise, and potentially transmission errors, is a challenging task. In this work we propose two postprocessing approaches applying convolutional neural networks (CNNs) either in the time domain or the cepstral domain to enhance the coded speech without any modification of the codecs. The time domain approach follows an end-to-end fashion, while the cepstral domain approach uses analysis-synthesis with cepstral domain features. The proposed postprocessors in both domains are evaluated for various narrowband and wideband speech codecs in a wide range of conditions. The proposed postprocessor improves speech quality (PESQ) by up to 0.25 MOS-LQO points for G.711, 0.30 points for G.726, 0.82 points for G.722, and 0.26 points for adaptive multirate wideband codec (AMR-WB). In a subjective CCR listening test, the proposed postprocessor on G.711-coded speech exceeds the speech quality of an ITU-T-standardized postfilter by 0.36 CMOS points, and obtains a clear preference of 1.77 CMOS points compared to G.711, even en par with uncoded speech. The source code for the cepstral domain approach to enhance G.711-coded speech is made available.

Abstract (translated)

增强来自远端声学背景噪声,量化噪声和潜在传输错误的编码语音是一项具有挑战性的任务。在这项工作中,我们提出了两种应用卷积神经网络(CNN)的后处理方法,无论是在时域还是在倒谱域,都可以在不对编解码器进行任何修改的情况下增强编码语音。时域方法遵循端到端的方式,而倒谱域方法使用分析 - 合成和倒谱域特征。所提出的两个域中的后处理器在各种条件下针对各种窄带和宽带语音编解码器进行评估。所提出的后处理器改进了语音质量(PESQ),对于G.711,语音质量最高可达0.25 MOS-LQO点,G.726为0.30点,G.722为0.82点,自适应多速率宽带编解码器(AMR-WB)为0.26点。在主观CCR聆听测试中,所提议的G.711编码语音后处理器超过了ITU-T标准化后置滤波器的语音质量0.36个CMOS点,与G.711相比,明显偏向1.77个CMOS点,甚至与未编码的语音相同。提供了用于增强G.711编码语音的倒谱域方法的源代码。

URL

https://arxiv.org/abs/1806.09411

PDF

https://arxiv.org/pdf/1806.09411.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot