MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis

Abstract
Abstract (translated)
URL
PDF

Abstract

Emotional Text-to-Speech (E-TTS) synthesis has gained significant attention in recent years due to its potential to enhance human-computer interaction. However, current E-TTS approaches often struggle to capture the complexity of human emotions, primarily relying on oversimplified emotional labels or single-modality inputs. To address these limitations, we propose the Multimodal Emotional Text-to-Speech System (MM-TTS), a unified framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. MM-TTS consists of two key components: (1) the Emotion Prompt Alignment Module (EP-Align), which employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information; and (2) the Emotion Embedding-Induced TTS (EMI-TTS), which integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions. Extensive evaluations across diverse datasets demonstrate the superior performance of MM-TTS compared to traditional E-TTS models. Objective metrics, including Word Error Rate (WER) and Character Error Rate (CER), show significant improvements on ESD dataset, with MM-TTS achieving scores of 7.35% and 3.07%, respectively. Subjective assessments further validate that MM-TTS generates speech with emotional fidelity and naturalness comparable to human speech. Our code and pre-trained models are publicly available at https://anonymous.4open.science/r/MMTTS-D214

Abstract (translated)

情感文本转语音（E-TTS）合成在近年来因增强人与计算机的互动潜力而受到广泛关注。然而，目前的E-TTS方法通常难以捕捉人类情感的复杂性，主要依赖过简化的情感标签或单模态输入。为了克服这些限制，我们提出了多模态情感文本转语音系统（MM-TTS），这是一个统一框架，利用多种模态的情感线索生成高度富有表现力和情感共鸣的语音。MM-TTS由两个关键组件组成：（1）情感提示对齐模块（EP-Align），它采用对比学习来对齐文本、音频和视觉模态中的情感特征，确保多模态信息的融合；（2）情感嵌入诱导的TTS（EMI-TTS），它将协调的情感嵌入与最先进的TTS模型集成，合成准确反映预期情感的语音。在多样数据集的广泛评估中，MM-TTS的表现优于传统E-TTS模型。客观指标，包括单词错误率（WER）和字符错误率（CER），在ESD数据集上显示显著改善，MM-TTS的分数分别为7.35%和3.07%。主观评估进一步证实，MM-TTS生成的语音具有与人类 speech 相同的情感忠实度和自然性。我们的代码和预训练模型是公开可用的，位于https://anonymous.4open.science/r/MMTTS-D214。

URL

https://arxiv.org/abs/2404.18398

PDF

https://arxiv.org/pdf/2404.18398.pdf

MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis

Abstract

Abstract (translated)

URL

PDF Copy

PDF