Perceptually Guided End-to-End Text-to-Speech

2020-11-02 18:13:48

Yeunju Choi, Youngmoon Jung, Youngjoo Suh, Hoirin Kim

arXiv_SD

Abstract
Abstract (translated)
URL
PDF

Abstract

Several fast text-to-speech (TTS) models have been proposed for real-time processing, but there is room for improvement in speech quality. Meanwhile, there is a mismatch between the loss function for training and the mean opinion score (MOS) for evaluation, which may limit the speech quality of TTS models. In this work, we propose a method that can improve the speech quality of a fast TTS model while maintaining the inference speed. To do so, we train a TTS model using a perceptual loss based on the predicted MOS. Under the supervision of a MOS prediction model, a TTS model can learn to increase the perceptual quality of speech directly. In experiments, we train FastSpeech on our internal Korean dataset using the MOS prediction model pre-trained on the Voice Conversion Challenge 2018 evaluation results. The MOS test results show that our proposed approach outperforms FastSpeech in speech quality.

Abstract (translated)

URL

https://arxiv.org/abs/2011.01174

PDF

https://arxiv.org/pdf/2011.01174.pdf