Paper Reading AI Learner

FlashSpeech: Efficient Zero-Shot Speech Synthesis

2024-04-23 02:57:46
Zhen Ye, Zeqian Ju, Haohe Liu, Xu Tan, Jianyi Chen, Yiwen Lu, Peiwen Sun, Jiahao Pan, Weizhen Bian, Shulin He, Qifeng Liu, Yike Guo, Wei Xue

Abstract

Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large-scale zero-shot speech synthesis system with approximately 5\% of the inference time compared with previous work. FlashSpeech is built on the latent consistency model and applies a novel adversarial consistency training approach that can train from scratch without the need for a pre-trained diffusion model as the teacher. Furthermore, a new prosody generator module enhances the diversity of prosody, making the rhythm of the speech sound more natural. The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio quality and high similarity to the audio prompt for zero-shot speech generation. Our experimental results demonstrate the superior performance of FlashSpeech. Notably, FlashSpeech can be about 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity. Furthermore, FlashSpeech demonstrates its versatility by efficiently performing tasks like voice conversion, speech editing, and diverse speech sampling. Audio samples can be found in this https URL.

Abstract (translated)

近年来,在大型零样本语音合成方面,自然语言处理(NLP)模型和扩散模型的进步显著加快了该领域的进展。然而,这两种方法的生成过程缓慢且计算密集。使用较低的计算预算实现与之前工作相同的质量仍然是一个重要的挑战。在本文中,我们提出了 FlashSpeech,一种大型零样本语音合成系统,与之前的工作相比,其推理时间减少了约 5%。FlashSpeech 基于潜在一致性模型,并应用了一种新颖的对抗性一致性训练方法,可以从零开始训练,无需预先训练的扩散模型作为教师。此外,一个新的元音生成器模块增强了元音的多样性,使语音节奏更加自然。FlashSpeech 的生成过程可以通过一个或两个采样步骤实现高效,同时保持高音频质量和与零样本语音生成的音频提示的高相似度。我们的实验结果证明了 FlashSpeech 的卓越性能。值得注意的是,FlashSpeech 可以在保持与其它零样本语音合成系统相当的声音质量和相似性的同时,大约 20 倍于其他系统。此外,FlashSpeech 通过有效地执行像语音转换、语音编辑和多样语音采样等任务,展示了其多才性。音频样本可在此链接中找到。

URL

https://arxiv.org/abs/2404.14700

PDF

https://arxiv.org/pdf/2404.14700.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot