Paper Reading AI Learner

A lightweight dual-stage framework for personalized speech enhancement based on DeepFilterNet2

2024-04-11 08:09:57
Thomas Serre (S2A, IDS), Mathieu Fontaine (S2A, IDS), Éric Benhaim, Geoffroy Dutour, Slim Essid (S2A, IDS)

Abstract

Isolating the desired speaker's voice amidst multiplespeakers in a noisy acoustic context is a challenging task. Per-sonalized speech enhancement (PSE) endeavours to achievethis by leveraging prior knowledge of the speaker's voice.Recent research efforts have yielded promising PSE mod-els, albeit often accompanied by computationally intensivearchitectures, unsuitable for resource-constrained embeddeddevices. In this paper, we introduce a novel method to per-sonalize a lightweight dual-stage Speech Enhancement (SE)model and implement it within DeepFilterNet2, a SE modelrenowned for its state-of-the-art performance. We seek anoptimal integration of speaker information within the model,exploring different positions for the integration of the speakerembeddings within the dual-stage enhancement architec-ture. We also investigate a tailored training strategy whenadapting DeepFilterNet2 to a PSE task. We show that ourpersonalization method greatly improves the performancesof DeepFilterNet2 while preserving minimal computationaloverhead.

Abstract (translated)

在一个噪音干扰的安静环境中,分离所需发言者的声音是一项具有挑战性的任务。为了实现这一目标,个人化语音增强(PSE)方法利用了发言者声音的先前知识。尽管最近的研究已经产生了有前景的PSE模型,但通常附带计算密集型架构,不适合资源受限的嵌入式设备。在本文中,我们提出了一种新的方法,对轻量级的双级语音增强(SE)模型进行个性化,并将其实现在大卫滤波器网络2中,该网络因其最先进的性能而闻名。我们寻求在模型中优化发言者信息的最佳 integration 位置,探讨将发言者嵌入在双级增强架构中的不同位置。我们还研究了在将大卫滤波器网络2适应PSE任务时如何实现适当的训练策略。我们证明了我们的个性化方法在提高DeepFilterNet2的性能的同时,保留了最小的计算开销。

URL

https://arxiv.org/abs/2404.08022

PDF

https://arxiv.org/pdf/2404.08022.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot