Abstract
As a foundational technology for intelligent human-computer interaction, voice conversion (VC) seeks to transform speech from any source timbre into any target timbre. Traditional voice conversion methods based on Generative Adversarial Networks (GANs) encounter significant challenges in precisely encoding diverse speech elements and effectively synthesising these elements into natural-sounding converted speech. To overcome these limitations, we introduce Pureformer-VC, an encoder-decoder framework that utilizes Conformer blocks to build a disentangled encoder and employs Zipformer blocks to create a style transfer decoder. We adopt a variational decoupled training approach to isolate speech components using a Variational Autoencoder (VAE), complemented by triplet discriminative training to enhance the speaker's discriminative capabilities. Furthermore, we incorporate the Attention Style Transfer Mechanism (ASTM) with Zipformer's shared weights to improve the style transfer performance in the decoder. We conducted experiments on two multi-speaker datasets. The experimental results demonstrate that the proposed model achieves comparable subjective evaluation scores while significantly enhancing objective metrics compared to existing approaches in many-to-many and many-to-one VC scenarios.
Abstract (translated)
作为智能人机交互的基础技术,语音转换(Voice Conversion, VC)旨在将任意来源音色的语音转化为目标音色。传统的基于生成对抗网络(Generative Adversarial Networks, GANs)的语音转换方法在精确编码各种语音元素及有效合成自然流畅的转换语音方面面临重大挑战。 为克服这些限制,我们提出了一种名为Pureformer-VC的编解码框架,该框架利用Conformer模块构建了一个分离式编码器,并采用Zipformer模块创建一个风格转移解码器。我们在训练过程中采用了变分解耦方法,使用变分自编码器(Variational Autoencoder, VAE)来隔离语音成分,并通过三元组判别性训练增强说话人的区分能力。 此外,我们还在Zipformer的共享权重中引入了注意风格传输机制(Attention Style Transfer Mechanism, ASTM),以改进解码器中的风格转换性能。 我们在两个多说话人数据集上进行了实验。实验结果表明,在许多到许多(many-to-many)和许多到一个(many-to-one)语音转换场景下,所提出的模型在主观评估得分方面与现有方法相当,同时显著提升了客观评价指标。
URL
https://arxiv.org/abs/2506.08348