Converting Anyone's Voice: End-to-End Expressive Voice Conversion with a Conditional Diffusion Model

Abstract
Abstract (translated)
URL
PDF

Abstract

Expressive voice conversion (VC) conducts speaker identity conversion for emotional speakers by jointly converting speaker identity and emotional style. Emotional style modeling for arbitrary speakers in expressive VC has not been extensively explored. Previous approaches have relied on vocoders for speech reconstruction, which makes speech quality heavily dependent on the performance of vocoders. A major challenge of expressive VC lies in emotion prosody modeling. To address these challenges, this paper proposes a fully end-to-end expressive VC framework based on a conditional denoising diffusion probabilistic model (DDPM). We utilize speech units derived from self-supervised speech models as content conditioning, along with deep features extracted from speech emotion recognition and speaker verification systems to model emotional style and speaker identity. Objective and subjective evaluations show the effectiveness of our framework. Codes and samples are publicly available.

Abstract (translated)

表达性语音转换（VC）通过共同转换说话人身份和情感风格来对情感说话人进行演讲者身份转换。对于表达性VC中任意说话人的情感风格建模，尚未进行过深入探讨。之前的解决方案依赖于语音重建器的语音重建，这使得语音质量高度依赖于语音重建器的表现。情感语音转换的一个主要挑战是情感语调建模。为了应对这些挑战，本文基于条件去噪扩散概率模型（DDPM）提出了一种完整的端到端情感语音转换框架。我们利用自监督语音模型产生的语音单元作为内容条件，同时从语音情感识别和说话人验证系统中提取深度特征来建模情感风格和说话人身份。客观和主观评估结果表明，我们的框架的有效性。代码和样本公开可用。

URL

https://arxiv.org/abs/2405.01730

PDF

https://arxiv.org/pdf/2405.01730.pdf

Converting Anyone's Voice: End-to-End Expressive Voice Conversion with a Conditional Diffusion Model

Abstract

Abstract (translated)

URL

PDF Copy

PDF