Paper Reading AI Learner

SE-Bridge: Speech Enhancement with Consistent Brownian Bridge

2023-05-23 08:06:36
Zhibin Qiu, Mengfan Fu, Fuchun Sun, Gulila Altenbek, Hao Huang

Abstract

We propose SE-Bridge, a novel method for speech enhancement (SE). After recently applying the diffusion models to speech enhancement, we can achieve speech enhancement by solving a stochastic differential equation (SDE). Each SDE corresponds to a probabilistic flow ordinary differential equation (PF-ODE), and the trajectory of the PF-ODE solution consists of the speech states at different moments. Our approach is based on consistency model that ensure any speech states on the same PF-ODE trajectory, correspond to the same initial state. By integrating the Brownian Bridge process, the model is able to generate high-intelligibility speech samples without adversarial training. This is the first attempt that applies the consistency models to SE task, achieving state-of-the-art results in several metrics while saving 15 x the time required for sampling compared to the diffusion-based baseline. Our experiments on multiple datasets demonstrate the effectiveness of SE-Bridge in SE. Furthermore, we show through extensive experiments on downstream tasks, including Automatic Speech Recognition (ASR) and Speaker Verification (SV), that SE-Bridge can effectively support multiple downstream tasks.

Abstract (translated)

我们提出SE-Bridge,一种用于语音增强的新颖方法(SE)。最近,我们应用扩散模型对语音增强进行了尝试,我们可以通过解决一宗随机微分方程(SDE)来实现语音增强。每个SDE对应着一宗概率流普通微分方程(PF-ODE),PF-ODE解的轨迹包含不同时刻的语音状态。我们的方法是基于一致性模型的,该模型确保在同一PF-ODE解轨迹上的任何语音状态都对应着相同的初始状态。通过整合布朗运动桥过程,模型能够生成高清晰度语音样本而无需对抗训练。这是第一个尝试将一致性模型应用于SE任务,在多个指标上实现了最先进的结果,而与扩散基线相比,采样所需的时间节省了15倍。我们对各种数据集的实验表明,SE-Bridge在SE任务中非常有效。此外,我们通过对后续任务,包括自动语音识别(ASR)和语音识别(SV)等广泛的实验,证明了SE-Bridge能够有效支持多个后续任务。

URL

https://arxiv.org/abs/2305.13796

PDF

https://arxiv.org/pdf/2305.13796.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot