Paper Reading AI Learner

Personalized Lightweight Text-to-Speech: Voice Cloning with Adaptive Structured Pruning

2023-03-21 12:59:46
Sung-Feng Huang, Chia-ping Chen, Zhi-Sheng Chen, Yu-Pao Tsai, Hung-yi Lee

Abstract

Personalized TTS is an exciting and highly desired application that allows users to train their TTS voice using only a few recordings. However, TTS training typically requires many hours of recording and a large model, making it unsuitable for deployment on mobile devices. To overcome this limitation, related works typically require fine-tuning a pre-trained TTS model to preserve its ability to generate high-quality audio samples while adapting to the target speaker's voice. This process is commonly referred to as ``voice cloning.'' Although related works have achieved significant success in changing the TTS model's voice, they are still required to fine-tune from a large pre-trained model, resulting in a significant size for the voice-cloned model. In this paper, we propose applying trainable structured pruning to voice cloning. By training the structured pruning masks with voice-cloning data, we can produce a unique pruned model for each target speaker. Our experiments demonstrate that using learnable structured pruning, we can compress the model size to 7 times smaller while achieving comparable voice-cloning performance.

Abstract (translated)

个性化TTS是一个令人兴奋且高度渴望的应用,它允许用户使用少数录制进行TTS语音训练。然而,TTS训练通常需要大量录制和大型模型,因此不适合在移动设备上部署。要克服这个限制,相关工作通常需要微调预先训练的TTS模型,以保留其生成高质量音频样本的能力,同时适应目标说话人的声音。这一过程通常被称为“语音克隆”。尽管相关工作已经成功地改变了TTS模型的声音,但它们仍然需要从大型预先训练模型进行微调,导致语音克隆模型的大小很大。在本文中,我们提议将可训练的结构压缩应用于语音克隆。通过使用语音克隆数据训练结构压缩 masks,我们可以为每个目标说话人生产一个独特的压缩模型。我们的实验表明,使用可训练的结构压缩,我们可以将模型大小压缩到7倍 smaller,同时实现类似的语音克隆性能。

URL

https://arxiv.org/abs/2303.11816

PDF

https://arxiv.org/pdf/2303.11816.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot