Paper Reading AI Learner

A Variational Prosody Model for the decomposition and synthesis of speech prosody

2018-06-22 14:14:30
Branislav Gerazov, Gérard Bailly, Omar Mohammed, Yi Xu, Philip N. Garner

Abstract

The quest for comprehensive generative models of intonation that link linguistic and paralinguistic functions to prosodic forms has been a longstanding challenge of speech communication research. More traditional intonation models have given way to the overwhelming performance of artificial intelligence (AI) techniques for training model-free, end-to-end mappings using millions of tunable parameters. The shift towards machine learning models has nonetheless posed the reverse problem - a compelling need to discover knowledge, to explain, visualise and interpret. Our work bridges between a comprehensive generative model of intonation and state-of-the-art AI techniques. We build upon the modelling paradigm of the Superposition of Functional Contours model and propose a Variational Prosody Model (VPM) that uses a network of deep variational contour generators to capture the context-sensitive variation of the constituent elementary prosodic cliches. We show that the VPM can give insight into the intrinsic variability of these prosodic prototypes through learning a meaningful prosodic latent space representation structure. We also show that the VPM brings improved modelling performance especially when such variability is prominent. In a speech synthesis scenario we believe the model can be used to generate a dynamic and natural prosody contour largely devoid of averaging effects.

Abstract (translated)

对语音和辅助语言功能与韵律形式相联系的语调综合生成模型的探索一直是言语交际研究的一个长期挑战。更传统的语调模型已经让位于人工智能(AI)技术的压倒性表现,该训练模型使用数百万个可调参数来训练无模型,端到端映射。向机器学习模式的转变仍然构成了相反的问题 - 迫切需要发现知识,解释,可视化和解释。我们的工作在全面的语调生成模式和最先进的人工智能技术之间架起了桥梁。我们建立在函数等值线叠加模型的建模范例之上,并提出了一种变分韵律模型(VPM),它使用深度变分轮廓生成器的网络来捕捉构成基本韵律陈词滥调的上下文敏感变化。我们证明VPM可以通过学习一个有意义的韵律潜在空间表示结构来洞察这些韵律原型的内在变异性。我们还表明VPM带来了改进的建模性能,尤其是当这种变化显着时。在语音合成场景中,我们相信该模型可用于生成动态和自然的韵律轮廓,很大程度上没有平均效果。

URL

https://arxiv.org/abs/1806.08685

PDF

https://arxiv.org/pdf/1806.08685.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot