Abstract
Personalized TTS is an exciting and highly desired application that allows users to train their TTS voice using only a few recordings. However, TTS training typically requires many hours of recording and a large model, making it unsuitable for deployment on mobile devices. To overcome this limitation, related works typically require fine-tuning a pre-trained TTS model to preserve its ability to generate high-quality audio samples while adapting to the target speaker's voice. This process is commonly referred to as ``voice cloning.'' Although related works have achieved significant success in changing the TTS model's voice, they are still required to fine-tune from a large pre-trained model, resulting in a significant size for the voice-cloned model. In this paper, we propose applying trainable structured pruning to voice cloning. By training the structured pruning masks with voice-cloning data, we can produce a unique pruned model for each target speaker. Our experiments demonstrate that using learnable structured pruning, we can compress the model size to 7 times smaller while achieving comparable voice-cloning performance.
Abstract (translated)
个性化TTS是一个令人兴奋且高度渴望的应用,它允许用户使用少数录制进行TTS语音训练。然而,TTS训练通常需要大量录制和大型模型,因此不适合在移动设备上部署。要克服这个限制,相关工作通常需要微调预先训练的TTS模型,以保留其生成高质量音频样本的能力,同时适应目标说话人的声音。这一过程通常被称为“语音克隆”。尽管相关工作已经成功地改变了TTS模型的声音,但它们仍然需要从大型预先训练模型进行微调,导致语音克隆模型的大小很大。在本文中,我们提议将可训练的结构压缩应用于语音克隆。通过使用语音克隆数据训练结构压缩 masks,我们可以为每个目标说话人生产一个独特的压缩模型。我们的实验表明,使用可训练的结构压缩,我们可以将模型大小压缩到7倍 smaller,同时实现类似的语音克隆性能。
URL
https://arxiv.org/abs/2303.11816