Abstract
Recent developments in voice cloning and talking head generation demonstrate impressive capabilities in synthesizing natural speech and realistic lip synchronization. Current methods typically require and are trained on large scale datasets and computationally intensive processes using clean studio recorded inputs that is infeasible in noisy or low resource environments. In this paper, we introduce a new modular pipeline comprising Tortoise text to speech. It is a transformer based latent diffusion model that can perform high fidelity zero shot voice cloning given only a few training samples. We use a lightweight generative adversarial network architecture for robust real time lip synchronization. The solution will contribute to many essential tasks concerning less reliance on massive pre training generation of emotionally expressive speech and lip synchronization in noisy and unconstrained scenarios. The modular structure of the pipeline allows an easy extension for future multi modal and text guided voice modulation and it could be used in real world systems.
Abstract (translated)
近期在语音克隆和生成面部动画方面的进展展示了合成自然语音和逼真口型同步的出色能力。然而,当前的方法通常需要并依赖于大规模的数据集以及使用清洁录音室录制输入的计算密集型过程,在嘈杂或资源匮乏的环境中这是不可行的。在这篇论文中,我们介绍了一种新的模块化流水线,其中包括基于Transformers的潜扩散模型Tortoise文本到语音系统。该模型能够在仅提供少量训练样本的情况下进行高质量零样本(zero-shot)语音克隆。我们还采用了一个轻量级生成对抗网络架构来实现鲁棒且实时的口型同步。 这种解决方案有助于许多关键任务,特别是在依赖大规模预训练的情感表达语音和噪声及非限制场景下的口型同步方面减少对大量数据的需求。流水线的模块化结构使得未来多模态和文本引导的声音调制易于扩展,并能够在实际系统中使用。
URL
https://arxiv.org/abs/2509.12831