Abstract
Expressive voice conversion aims to transfer both speaker identity and expressive attributes from a target speech to a given source speech. In this work, we improve over a self-supervised, non-autoregressive framework with a conditional variational autoencoder, focusing on reducing source timbre leakage and improving linguistic-acoustic disentanglement for better style transfer. To minimize style leakage, we use multilingual discrete speech units for content representation and reinforce embeddings with augmentation-based similarity loss and mix-style layer normalization. To enhance expressivity transfer, we incorporate local F0 information via cross-attention and extract style embeddings enriched with global pitch and energy features. Experiments show our model outperforms baselines in emotion and speaker similarity, demonstrating superior style adaptation and reduced source style leakage.
Abstract (translated)
表达式音色转换的目标是从目标语音中转移说话人的身份以及情感特征到给定的源语音上。在这项工作中,我们改进了一个自监督、非自回归框架,并采用条件变分自动编码器(Conditional Variational Autoencoder, CVAE),专注于减少源语音的音色泄露并提高语言声学特征的解耦合能力,从而更好地进行风格转换。 为了最小化风格泄漏,我们使用多语言离散语音单元来表示内容,并通过基于数据增强的相似性损失和混合样式层归一化(mix-style layer normalization)来强化嵌入。为了提升情感表达的转移效果,我们在交叉注意力机制中引入了局部基频信息,并提取出富含全局音高和能量特征的情感嵌入。 实验结果显示,我们的模型在情绪和说话人相似度方面优于基准模型,展示了其卓越的风格适应能力和减少源语音风格泄漏的能力。
URL
https://arxiv.org/abs/2506.04013