Abstract
The quest for comprehensive generative models of intonation that link linguistic and paralinguistic functions to prosodic forms has been a longstanding challenge of speech communication research. More traditional intonation models have given way to the overwhelming performance of artificial intelligence (AI) techniques for training model-free, end-to-end mappings using millions of tunable parameters. The shift towards machine learning models has nonetheless posed the reverse problem - a compelling need to discover knowledge, to explain, visualise and interpret. Our work bridges between a comprehensive generative model of intonation and state-of-the-art AI techniques. We build upon the modelling paradigm of the Superposition of Functional Contours model and propose a Variational Prosody Model (VPM) that uses a network of deep variational contour generators to capture the context-sensitive variation of the constituent elementary prosodic cliches. We show that the VPM can give insight into the intrinsic variability of these prosodic prototypes through learning a meaningful prosodic latent space representation structure. We also show that the VPM brings improved modelling performance especially when such variability is prominent. In a speech synthesis scenario we believe the model can be used to generate a dynamic and natural prosody contour largely devoid of averaging effects.
Abstract (translated)
对语音和辅助语言功能与韵律形式相联系的语调综合生成模型的探索一直是言语交际研究的一个长期挑战。更传统的语调模型已经让位于人工智能(AI)技术的压倒性表现,该训练模型使用数百万个可调参数来训练无模型,端到端映射。向机器学习模式的转变仍然构成了相反的问题 - 迫切需要发现知识,解释,可视化和解释。我们的工作在全面的语调生成模式和最先进的人工智能技术之间架起了桥梁。我们建立在函数等值线叠加模型的建模范例之上,并提出了一种变分韵律模型(VPM),它使用深度变分轮廓生成器的网络来捕捉构成基本韵律陈词滥调的上下文敏感变化。我们证明VPM可以通过学习一个有意义的韵律潜在空间表示结构来洞察这些韵律原型的内在变异性。我们还表明VPM带来了改进的建模性能,尤其是当这种变化显着时。在语音合成场景中,我们相信该模型可用于生成动态和自然的韵律轮廓,很大程度上没有平均效果。
URL
https://arxiv.org/abs/1806.08685