FlashSpeech: Efficient Zero-Shot Speech Synthesis

2024-04-23 02:57:46
Zhen Ye, Zeqian Ju, Haohe Liu, Xu Tan, Jianyi Chen, Yiwen Lu, Peiwen Sun, Jiahao Pan, Weizhen Bian, Shulin He, Qifeng Liu, Yike Guo, Wei Xue


Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large-scale zero-shot speech synthesis system with approximately 5\% of the inference time compared with previous work. FlashSpeech is built on the latent consistency model and applies a novel adversarial consistency training approach that can train from scratch without the need for a pre-trained diffusion model as the teacher. Furthermore, a new prosody generator module enhances the diversity of prosody, making the rhythm of the speech sound more natural. The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio quality and high similarity to the audio prompt for zero-shot speech generation. Our experimental results demonstrate the superior performance of FlashSpeech. Notably, FlashSpeech can be about 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity. Furthermore, FlashSpeech demonstrates its versatility by efficiently performing tasks like voice conversion, speech editing, and diverse speech sampling. Audio samples can be found in this https URL.

Abstract (translated)

近年来,在大型零样本语音合成方面,自然语言处理(NLP)模型和扩散模型的进步显著加快了该领域的进展。然而,这两种方法的生成过程缓慢且计算密集。使用较低的计算预算实现与之前工作相同的质量仍然是一个重要的挑战。在本文中,我们提出了 FlashSpeech,一种大型零样本语音合成系统,与之前的工作相比,其推理时间减少了约 5%。FlashSpeech 基于潜在一致性模型,并应用了一种新颖的对抗性一致性训练方法,可以从零开始训练,无需预先训练的扩散模型作为教师。此外,一个新的元音生成器模块增强了元音的多样性,使语音节奏更加自然。FlashSpeech 的生成过程可以通过一个或两个采样步骤实现高效,同时保持高音频质量和与零样本语音生成的音频提示的高相似度。我们的实验结果证明了 FlashSpeech 的卓越性能。值得注意的是,FlashSpeech 可以在保持与其它零样本语音合成系统相当的声音质量和相似性的同时,大约 20 倍于其他系统。此外,FlashSpeech 通过有效地执行像语音转换、语音编辑和多样语音采样等任务,展示了其多才性。音频样本可在此链接中找到。



