Paper Reading AI Learner

Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement

2025-02-11 04:18:33
Xueyao Zhang, Xiaohui Zhang, Kainan Peng, Zhenyu Tang, Vimal Manohar, Yingru Liu, Jeff Hwang, Dangna Li, Yuhao Wang, Julian Chan, Yuan Huang, Zhizheng Wu, Mingbo Ma

Abstract

The imitation of voice, targeted on specific speech attributes such as timbre and speaking style, is crucial in speech generation. However, existing methods rely heavily on annotated data, and struggle with effectively disentangling timbre and style, leading to challenges in achieving controllable generation, especially in zero-shot scenarios. To address these issues, we propose Vevo, a versatile zero-shot voice imitation framework with controllable timbre and style. Vevo operates in two core stages: (1) Content-Style Modeling: Given either text or speech's content tokens as input, we utilize an autoregressive transformer to generate the content-style tokens, which is prompted by a style reference; (2) Acoustic Modeling: Given the content-style tokens as input, we employ a flow-matching transformer to produce acoustic representations, which is prompted by a timbre reference. To obtain the content and content-style tokens of speech, we design a fully self-supervised approach that progressively decouples the timbre, style, and linguistic content of speech. Specifically, we adopt VQ-VAE as the tokenizer for the continuous hidden features of HuBERT. We treat the vocabulary size of the VQ-VAE codebook as the information bottleneck, and adjust it carefully to obtain the disentangled speech representations. Solely self-supervised trained on 60K hours of audiobook speech data, without any fine-tuning on style-specific corpora, Vevo matches or surpasses existing methods in accent and emotion conversion tasks. Additionally, Vevo's effectiveness in zero-shot voice conversion and text-to-speech tasks further demonstrates its strong generalization and versatility. Audio samples are available at this https URL.

Abstract (translated)

声音模仿,特别是针对音色和说话风格等特定语音属性的模仿,在语音生成领域至关重要。然而,现有的方法严重依赖于标注数据,并且难以有效分离音色与风格,这在实现可控生成方面尤其具有挑战性,特别是在零样本(zero-shot)场景中。为了应对这些问题,我们提出了Vevo,一个灵活的零样本声音模仿框架,能够控制音色和风格。 Vevo主要通过两个核心阶段运作: 1. **内容-风格建模**:给定文本或语音的内容标记作为输入,我们使用自回归变压器生成在风格参考提示下引导的内容-风格标记。 2. **声学建模**:给定内容-风格标记作为输入,我们采用流匹配变压器产生音色参考引导下的声学表示。 为了获取语音的内容和内容-风格标记,我们设计了一种完全自我监督的方法,逐步解耦语音的音色、风格和语言内容。具体来说,我们将VQ-VAE用作HuBERT连续隐藏特征的标记器,并将VQ-VAE代码本词汇大小作为信息瓶颈,仔细调整以获得分离的声音表示。 仅通过60,000小时的有声书语音数据进行完全自我监督训练,而无需在风格特定语料库上微调,Vevo在口音和情感转换任务中与现有方法相匹配或超越。此外,在零样本声音转换和文本到语音任务中的有效性进一步证明了其强大的泛化能力和多功能性。 音频示例可在[此链接](https://this-url.com)访问(原信息请见原文)。

URL

https://arxiv.org/abs/2502.07243

PDF

https://arxiv.org/pdf/2502.07243.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot