WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses

2022-03-21 06:42:44

Zewang Zhang, Yibin Zheng, Xinhui Li, Li Lu

arXiv_CL

Abstract
Abstract (translated)
URL
PDF

Abstract

In this paper, we develop a new multi-singer Chinese neural singing voice synthesis (SVS) system named WeSinger. To improve the accuracy and naturalness of synthesized singing voice, we design several specifical modules and techniques: 1) A deep bi-directional LSTM based duration model with multi-scale rhythm loss and post-processing step; 2) A Transformer-alike acoustic model with progressive pitch-weighted decoder loss; 3) a 24 kHz pitch-aware LPCNet neural vocoder to produce high-quality singing waveforms; 4) A novel data augmentation method with multi-singer pre-training for stronger robustness and naturalness. Both quantitative and qualitative evaluation results demonstrate the effectiveness of WeSinger in terms of accuracy and naturalness, and WeSinger achieves state-of-the-art performance on the public corpus Opencpop. Some synthesized singing samples are available online\footnote{this https URL}

Abstract (translated)

URL

https://arxiv.org/abs/2203.10750

PDF

https://arxiv.org/pdf/2203.10750.pdf