InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

Abstract
Abstract (translated)
URL
PDF

Abstract

Expressive text-to-speech (TTS) aims to synthesize different speaking style speech according to human's demands. Nowadays, there are two common ways to control speaking styles: (1) Pre-defining a group of speaking style and using categorical index to denote different speaking style. However, there are limitations in the diversity of expressiveness, as these models can only generate the pre-defined styles. (2) Using reference speech as style input, which results in a problem that the extracted style information is not intuitive or interpretable. In this study, we attempt to use natural language as style prompt to control the styles in the synthetic speech, \textit{e.g.}, ``Sigh tone in full of sad mood with some helpless feeling". Considering that there is no existing TTS corpus which is proper to benchmark this novel task, we first construct a speech corpus, whose speech samples are annotated with not only content transcriptions but also style descriptions in natural language. Then we propose an expressive TTS model, named as InstructTTS, which is novel in the sense of following aspects: (1) We fully take the advantage of self-supervised learning and cross-modal metric learning, and propose a novel three-stage training procedure to obtain a robust sentence embedding model, which can effectively capture semantic information from the style prompts and control the speaking style in the generated speech. (2) We propose to model acoustic features in discrete latent space and train a novel discrete diffusion probabilistic model to generate vector-quantized (VQ) acoustic tokens rather than the commonly-used mel spectrogram. (3) We jointly apply mutual information (MI) estimation and minimization during acoustic model training to minimize style-speaker and style-content MI, avoiding possible content and speaker information leakage from the style prompt.

Abstract (translated)

表达性文本到语音合成(TTS)的目标是根据人类的要求合成不同说话风格的声音。现在,有两种常见的控制说话风格的方法:(1)预先定义一组说话风格,并使用分类索引表示不同的说话风格。然而,表达性的多样性受到限制,因为这些模型只能生成预先定义的风格。(2)使用参考语音作为风格输入,这导致一个问题,即提取的风格信息不直观或可解释。在本研究中,我们尝试使用自然语言作为风格提示来控制合成语音中的说话风格,例如:“叹息音带有一些无助的感觉”。考虑到没有适当的基准TTS库来测试这个新任务,我们首先建立一个语音库,其语音样本不仅包含内容摘要,还包含自然语言的风格的描述。然后,我们提出了一个表达性的TTS模型,称为InstructTTS,其特点是:(1)我们完全利用自监督学习和跨媒体度量学习的优势,并提出了一种新的三阶段训练程序,以获得一个稳定的句子嵌入模型,该模型可以有效地从风格提示中提取语义信息,并控制生成语音中的说话风格。(2)我们提议在离散潜在空间中建模声学特征,并训练一种新的离散扩散概率模型,以生成向量量化(VQ)声学 token而不是常用的mel spectrogram。(3)在声学模型训练期间,我们同时应用互信息估计和最小化,以最小化风格说话和风格内容互信息,避免从风格提示中可能泄露的内容和说话信息。

URL

https://arxiv.org/abs/2301.13662

PDF

https://arxiv.org/pdf/2301.13662.pdf