Paper Reading AI Learner

Converting Anyone's Voice: End-to-End Expressive Voice Conversion with a Conditional Diffusion Model

2024-05-02 20:51:53
Zongyang Du, Junchen Lu, Kun Zhou, Lakshmish Kaushik, Berrak Sisman

Abstract

Expressive voice conversion (VC) conducts speaker identity conversion for emotional speakers by jointly converting speaker identity and emotional style. Emotional style modeling for arbitrary speakers in expressive VC has not been extensively explored. Previous approaches have relied on vocoders for speech reconstruction, which makes speech quality heavily dependent on the performance of vocoders. A major challenge of expressive VC lies in emotion prosody modeling. To address these challenges, this paper proposes a fully end-to-end expressive VC framework based on a conditional denoising diffusion probabilistic model (DDPM). We utilize speech units derived from self-supervised speech models as content conditioning, along with deep features extracted from speech emotion recognition and speaker verification systems to model emotional style and speaker identity. Objective and subjective evaluations show the effectiveness of our framework. Codes and samples are publicly available.

Abstract (translated)

表达性语音转换(VC)通过共同转换说话人身份和情感风格来对情感说话人进行演讲者身份转换。对于表达性VC中任意说话人的情感风格建模,尚未进行过深入探讨。之前的解决方案依赖于语音重建器的语音重建,这使得语音质量高度依赖于语音重建器的表现。情感语音转换的一个主要挑战是情感语调建模。为了应对这些挑战,本文基于条件去噪扩散概率模型(DDPM)提出了一种完整的端到端情感语音转换框架。我们利用自监督语音模型产生的语音单元作为内容条件,同时从语音情感识别和说话人验证系统中提取深度特征来建模情感风格和说话人身份。客观和主观评估结果表明,我们的框架的有效性。代码和样本公开可用。

URL

https://arxiv.org/abs/2405.01730

PDF

https://arxiv.org/pdf/2405.01730.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot