Personalized Lightweight Text-to-Speech: Voice Cloning with Adaptive Structured Pruning

Abstract
Abstract (translated)
URL
PDF

Abstract

Personalized TTS is an exciting and highly desired application that allows users to train their TTS voice using only a few recordings. However, TTS training typically requires many hours of recording and a large model, making it unsuitable for deployment on mobile devices. To overcome this limitation, related works typically require fine-tuning a pre-trained TTS model to preserve its ability to generate high-quality audio samples while adapting to the target speaker's voice. This process is commonly referred to as ``voice cloning.'' Although related works have achieved significant success in changing the TTS model's voice, they are still required to fine-tune from a large pre-trained model, resulting in a significant size for the voice-cloned model. In this paper, we propose applying trainable structured pruning to voice cloning. By training the structured pruning masks with voice-cloning data, we can produce a unique pruned model for each target speaker. Our experiments demonstrate that using learnable structured pruning, we can compress the model size to 7 times smaller while achieving comparable voice-cloning performance.

Abstract (translated)

个性化TTS是一个令人兴奋且高度渴望的应用,它允许用户使用少数录制进行TTS语音训练。然而,TTS训练通常需要大量录制和大型模型,因此不适合在移动设备上部署。要克服这个限制,相关工作通常需要微调预先训练的TTS模型,以保留其生成高质量音频样本的能力,同时适应目标说话人的声音。这一过程通常被称为“语音克隆”。尽管相关工作已经成功地改变了TTS模型的声音,但它们仍然需要从大型预先训练模型进行微调,导致语音克隆模型的大小很大。在本文中,我们提议将可训练的结构压缩应用于语音克隆。通过使用语音克隆数据训练结构压缩 masks,我们可以为每个目标说话人生产一个独特的压缩模型。我们的实验表明,使用可训练的结构压缩,我们可以将模型大小压缩到7倍 smaller,同时实现类似的语音克隆性能。

URL

https://arxiv.org/abs/2303.11816

PDF

https://arxiv.org/pdf/2303.11816.pdf