Exploring Timbre Disentanglement in Non-Autoregressive Cross-Lingual Text-to-Speech

2021-10-14 07:37:04

Haoyue Zhan, Xinyuan Yu, Haitong Zhang, Yang Zhang, Yue Lin

arXiv_SD

arXiv_SD Pose Speech

Abstract
Abstract (translated)
URL
PDF

Abstract

In this paper, we present a FastPitch-based non-autoregressive cross-lingual Text-to-Speech (TTS) model built with language independent input representation and monolingual force aligners. We propose a phoneme length regulator that solves the length mismatch problem between language-independent phonemes and monolingual alignment results. Our experiments show that (1) an increasing number of training speakers encourages non-autoregressive cross-lingual TTS model to disentangle speaker and language representations, and (2) variance adaptors of FastPitch model can help disentangle speaker identity from learned representations in cross-lingual TTS. The subjective evaluation shows that our proposed model is able to achieve decent speaker consistency and similarity. We further improve the naturalness of Mandarin-dominated mixed-lingual utterances by utilizing the controllability of our proposed model.

Abstract (translated)

URL

https://arxiv.org/abs/2110.07192

PDF

https://arxiv.org/pdf/2110.07192.pdf