Abstract
The prosodic aspects of speech signals produced by current text-to-speech systems are typically averaged over training material, and as such lack the variety and liveliness found in natural speech. To avoid monotony and averaged prosody contours, it is desirable to have a way of modeling the variation in the prosodic aspects of speech, so audio signals can be synthesized in multiple ways for a given text. We present a new, hierarchically structured conditional variational autoencoder to generate prosodic features (fundamental frequency, energy and duration) suitable for use with a vocoder or a generative model like WaveNet. At inference time, an embedding representing the prosody of a sentence may be sampled from the variational layer to allow for prosodic variation. To efficiently capture the hierarchical nature of the linguistic input (words, syllables and phones), both the encoder and decoder parts of the auto-encoder are hierarchical, in line with the linguistic structure, with layers being clocked dynamically at the respective rates. We show in our experiments that our dynamic hierarchical network outperforms a non-hierarchical state-of-the-art baseline, and, additionally, that prosody transfer across sentences is possible by employing the prosody embedding of one sentence to generate the speech signal of another.
Abstract (translated)
当前文本到语音转换系统产生的语音信号的韵律方面通常在培训材料上平均,因此缺乏自然语音中的多样性和生动性。为了避免单调和平均的韵律轮廓,人们希望有一种方法来模拟语音韵律方面的变化,这样就可以为给定的文本以多种方式合成音频信号。我们提出了一个新的,层次结构的条件变化自动编码器,以产生韵律特征(基频,能量和持续时间),适用于声码器或生成模型,如wavenet。在推理时,可以从变分层中抽取代表句子韵律的嵌入,以允许韵律变化。为了有效地捕获语言输入(单词、音节和电话)的层次性,自动编码器的编码器和解码器部分都是层次性的,与语言结构一致,层以各自的速率动态计时。我们的实验表明,我们的动态分层网络优于非分层的最新基线,而且,通过使用一个句子的韵律嵌入来生成另一个句子的语音信号,可以实现句子间的韵律转换。
URL
https://arxiv.org/abs/1905.07195