Abstract
Zero-Shot Voice Conversion (VC) aims to transform the source speaker's timbre into an arbitrary unseen one while retaining speech content. Most prior work focuses on preserving the source's prosody, while fine-grained timbre information may leak through prosody, and transferring target prosody to synthesized speech is rarely studied. In light of this, we propose R-VC, a rhythm-controllable and efficient zero-shot voice conversion model. R-VC employs data perturbation techniques and discretize source speech into Hubert content tokens, eliminating much content-irrelevant information. By leveraging a Mask Generative Transformer for in-context duration modeling, our model adapts the linguistic content duration to the desired target speaking style, facilitating the transfer of the target speaker's rhythm. Furthermore, R-VC introduces a powerful Diffusion Transformer (DiT) with shortcut flow matching during training, conditioning the network not only on the current noise level but also on the desired step size, enabling high timbre similarity and quality speech generation in fewer sampling steps, even in just two, thus minimizing latency. Experimental results show that R-VC achieves comparable speaker similarity to state-of-the-art VC methods with a smaller dataset, and surpasses them in terms of speech naturalness, intelligibility and style transfer performance.
Abstract (translated)
零样本语音转换(Zero-Shot Voice Conversion,简称VC)旨在将源说话人的音色转化为任意未见过的目标音色,同时保留语音的内容信息。大多数先前的工作主要集中在保持源说话人的语调上,而细粒度的音色信息可能会通过语调泄露出来,并且很少有研究关注如何将目标语调转移到合成的语音中。鉴于此,我们提出了R-VC(节奏可控和高效的零样本语音转换模型)。R-VC采用数据扰动技术并把源语音离散化为Hubert内容令牌,从而消除了大量与内容无关的信息。通过利用上下文时间长度建模的掩码生成Transformer,我们的模型能够将语言内容的时间长度适应到所需的说话风格上,这有助于目标说话人节奏的转移。 此外,R-VC引入了一种强大的扩散变压器(Diffusion Transformer,简称DiT),在训练过程中采用跳跃流匹配技术。该网络不仅根据当前噪声级别进行条件设定,还根据所需步长进行调节,从而能够在较少的采样步骤中实现高音色相似度和高质量语音生成,甚至只需两步就能做到这一点,大大降低了延迟。 实验结果显示,R-VC在较小的数据集上达到了与现有最佳VC方法相当的说话人相似性,并且在自然度、可懂性和风格转换性能方面超越了它们。
URL
https://arxiv.org/abs/2506.01014