Abstract
We work to create a multilingual speech synthesis system which can generate speech with the proper accent while retaining the characteristics of an individual voice. This is challenging to do because it is expensive to obtain bilingual training data in multiple languages, and the lack of such data results in strong correlations that entangle speaker, language, and accent, resulting in poor transfer capabilities. To overcome this, we present a multilingual, multiaccented, multispeaker speech synthesis model based on RADTTS with explicit control over accent, language, speaker and fine-grained $F_0$ and energy features. Our proposed model does not rely on bilingual training data. We demonstrate an ability to control synthesized accent for any speaker in an open-source dataset comprising of 7 accents. Human subjective evaluation demonstrates that our model can better retain a speaker's voice and accent quality than controlled baselines while synthesizing fluent speech in all target languages and accents in our dataset.
Abstract (translated)
我们致力于创建一个多语言语音合成系统,该系统能够在保持个人声音特点的同时生成正确的口音。这是一项挑战性的任务,因为要在多个语言中获取双语训练数据非常昂贵,缺乏这样的数据会导致强相关性,将 speaker、语言和口音联系在一起,导致传输能力不佳。为了克服这个问题,我们提出了一个基于RADTTS的多个语言、多个口音、多个说话者语音合成模型,该系统 explicit 控制 accent、语言、说话人和精细的 $F_0$ 和能量特征。我们提出的模型不依赖双语训练数据。我们在一个包含7种口音的开源数据集上展示了能够控制合成的口音的能力。人类主观评价表明,我们的模型能够在合成我们的数据集上目标语言和口音的所有语言和口音的流利语音时更好地保留说话人和口音的质量。
URL
https://arxiv.org/abs/2301.10335