Paper Reading AI Learner

BUDDy: Single-Channel Blind Unsupervised Dereverberation with Diffusion Models

2024-05-07 12:41:31
Eloi Moliner, Jean-Marie Lemercier, Simon Welker, Timo Gerkmann, Vesa V\"alim\"aki

Abstract

In this paper, we present an unsupervised single-channel method for joint blind dereverberation and room impulse response estimation, based on posterior sampling with diffusion models. We parameterize the reverberation operator using a filter with exponential decay for each frequency subband, and iteratively estimate the corresponding parameters as the speech utterance gets refined along the reverse diffusion trajectory. A measurement consistency criterion enforces the fidelity of the generated speech with the reverberant measurement, while an unconditional diffusion model implements a strong prior for clean speech generation. Without any knowledge of the room impulse response nor any coupled reverberant-anechoic data, we can successfully perform dereverberation in various acoustic scenarios. Our method significantly outperforms previous blind unsupervised baselines, and we demonstrate its increased robustness to unseen acoustic conditions in comparison to blind supervised methods. Audio samples and code are available online.

Abstract (translated)

在本文中,我们提出了一种基于后验采样和扩散模型的联合盲去噪和室脉冲响应估计方法。我们通过指数衰减的滤波器对每个频率子带进行参数化,并沿着反向扩散轨迹对语音语调进行细化,逐步估计相应的参数。一个测量一致性标准确保生成的语音与回声测量保持一致,而条件扩散模型则实现了对干净语音生成的强大假设。在没有了解室脉冲响应,也没有任何耦合的回声-等化数据的情况下,我们可以在各种声学场景中成功进行去噪。与之前基于无监督学习的盲去噪 baseline 相比,我们的方法显著性能更卓越,并且我们证明了与盲监督方法相比,其对未见过的声学条件的鲁棒性有所提高。音频样本和代码可在网上获取。

URL

https://arxiv.org/abs/2405.04272

PDF

https://arxiv.org/pdf/2405.04272.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot