Abstract
Emotional Text-to-Speech (E-TTS) synthesis has gained significant attention in recent years due to its potential to enhance human-computer interaction. However, current E-TTS approaches often struggle to capture the complexity of human emotions, primarily relying on oversimplified emotional labels or single-modality inputs. To address these limitations, we propose the Multimodal Emotional Text-to-Speech System (MM-TTS), a unified framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. MM-TTS consists of two key components: (1) the Emotion Prompt Alignment Module (EP-Align), which employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information; and (2) the Emotion Embedding-Induced TTS (EMI-TTS), which integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions. Extensive evaluations across diverse datasets demonstrate the superior performance of MM-TTS compared to traditional E-TTS models. Objective metrics, including Word Error Rate (WER) and Character Error Rate (CER), show significant improvements on ESD dataset, with MM-TTS achieving scores of 7.35% and 3.07%, respectively. Subjective assessments further validate that MM-TTS generates speech with emotional fidelity and naturalness comparable to human speech. Our code and pre-trained models are publicly available at https://anonymous.4open.science/r/MMTTS-D214
Abstract (translated)
情感文本转语音(E-TTS)合成在近年来因增强人与计算机的互动潜力而受到广泛关注。然而,目前的E-TTS方法通常难以捕捉人类情感的复杂性,主要依赖过简化的情感标签或单模态输入。为了克服这些限制,我们提出了多模态情感文本转语音系统(MM-TTS),这是一个统一框架,利用多种模态的情感线索生成高度富有表现力和情感共鸣的语音。MM-TTS由两个关键组件组成:(1)情感提示对齐模块(EP-Align),它采用对比学习来对齐文本、音频和视觉模态中的情感特征,确保多模态信息的融合;(2)情感嵌入诱导的TTS(EMI-TTS),它将协调的情感嵌入与最先进的TTS模型集成,合成准确反映预期情感的语音。在多样数据集的广泛评估中,MM-TTS的表现优于传统E-TTS模型。客观指标,包括单词错误率(WER)和字符错误率(CER),在ESD数据集上显示显著改善,MM-TTS的分数分别为7.35%和3.07%。主观评估进一步证实,MM-TTS生成的语音具有与人类 speech 相同的情感忠实度和自然性。我们的代码和预训练模型是公开可用的,位于https://anonymous.4open.science/r/MMTTS-D214。
URL
https://arxiv.org/abs/2404.18398