Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement

Abstract
Abstract (translated)
URL
PDF

Abstract

The imitation of voice, targeted on specific speech attributes such as timbre and speaking style, is crucial in speech generation. However, existing methods rely heavily on annotated data, and struggle with effectively disentangling timbre and style, leading to challenges in achieving controllable generation, especially in zero-shot scenarios. To address these issues, we propose Vevo, a versatile zero-shot voice imitation framework with controllable timbre and style. Vevo operates in two core stages: (1) Content-Style Modeling: Given either text or speech's content tokens as input, we utilize an autoregressive transformer to generate the content-style tokens, which is prompted by a style reference; (2) Acoustic Modeling: Given the content-style tokens as input, we employ a flow-matching transformer to produce acoustic representations, which is prompted by a timbre reference. To obtain the content and content-style tokens of speech, we design a fully self-supervised approach that progressively decouples the timbre, style, and linguistic content of speech. Specifically, we adopt VQ-VAE as the tokenizer for the continuous hidden features of HuBERT. We treat the vocabulary size of the VQ-VAE codebook as the information bottleneck, and adjust it carefully to obtain the disentangled speech representations. Solely self-supervised trained on 60K hours of audiobook speech data, without any fine-tuning on style-specific corpora, Vevo matches or surpasses existing methods in accent and emotion conversion tasks. Additionally, Vevo's effectiveness in zero-shot voice conversion and text-to-speech tasks further demonstrates its strong generalization and versatility. Audio samples are available at this https URL.

Abstract (translated)

声音模仿，特别是针对音色和说话风格等特定语音属性的模仿，在语音生成领域至关重要。然而，现有的方法严重依赖于标注数据，并且难以有效分离音色与风格，这在实现可控生成方面尤其具有挑战性，特别是在零样本（zero-shot）场景中。为了应对这些问题，我们提出了Vevo，一个灵活的零样本声音模仿框架，能够控制音色和风格。 Vevo主要通过两个核心阶段运作： 1. **内容-风格建模**：给定文本或语音的内容标记作为输入，我们使用自回归变压器生成在风格参考提示下引导的内容-风格标记。 2. **声学建模**：给定内容-风格标记作为输入，我们采用流匹配变压器产生音色参考引导下的声学表示。为了获取语音的内容和内容-风格标记，我们设计了一种完全自我监督的方法，逐步解耦语音的音色、风格和语言内容。具体来说，我们将VQ-VAE用作HuBERT连续隐藏特征的标记器，并将VQ-VAE代码本词汇大小作为信息瓶颈，仔细调整以获得分离的声音表示。仅通过60,000小时的有声书语音数据进行完全自我监督训练，而无需在风格特定语料库上微调，Vevo在口音和情感转换任务中与现有方法相匹配或超越。此外，在零样本声音转换和文本到语音任务中的有效性进一步证明了其强大的泛化能力和多功能性。音频示例可在[此链接](https://this-url.com)访问（原信息请见原文）。

URL

https://arxiv.org/abs/2502.07243

PDF

https://arxiv.org/pdf/2502.07243.pdf

Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement

Abstract

Abstract (translated)

URL

PDF Copy

PDF