Abstract
We introduce HybridVC, a voice conversion (VC) framework built upon a pre-trained conditional variational autoencoder (CVAE) that combines the strengths of a latent model with contrastive learning. HybridVC supports text and audio prompts, enabling more flexible voice style conversion. HybridVC models a latent distribution conditioned on speaker embeddings acquired by a pretrained speaker encoder and optimises style text embeddings to align with the speaker style information through contrastive learning in parallel. Therefore, HybridVC can be efficiently trained under limited computational resources. Our experiments demonstrate HybridVC's superior training efficiency and its capability for advanced multi-modal voice style conversion. This underscores its potential for widespread applications such as user-defined personalised voice in various social media platforms. A comprehensive ablation study further validates the effectiveness of our method.
Abstract (translated)
我们提出了HybridVC,一种基于预训练条件变异自动编码器(CVAE)的语音转换(VC)框架,结合了潜在模型的优势和对比学习的力量。HybridVC支持文本和音频提示,实现更灵活的语音风格转换。HybridVC基于预训练说话人编码器获得的说话人嵌入,通过并行对比学习优化样式文本嵌入,使其与说话人风格信息对齐。因此,HybridVC可以在有限的计算资源下高效训练。我们的实验证明了HybridVC卓越的训练效率和其在高级多模态语音风格转换方面的能力。这进一步证明了其在各种社交媒体平台中实现用户定义个性化语音的广泛应用潜力。全面的消融研究进一步验证了我们的方法的有效性。
URL
https://arxiv.org/abs/2404.15637