Abstract
Here we present a novel approach to conditioning the SampleRNN generative model for voice conversion (VC). Conventional methods for VC modify the perceived speaker identity by converting between source and target acoustic features. Our approach focuses on preserving voice content and depends on the generative network to learn voice style. We first train a multi-speaker SampleRNN model conditioned on linguistic features, pitch contour, and speaker identity using a multi-speaker speech corpus. Voice-converted speech is generated using linguistic features and pitch contour extracted from the source speaker, and the target speaker identity. We demonstrate that our system is capable of many-to-many voice conversion without requiring parallel data, enabling broad applications. Subjective evaluation demonstrates that our approach outperforms conventional VC methods.
Abstract (translated)
在这里,我们提出了一种新的方法来调节SampleRNN语音转换(VC)生成模型。用于VC的传统方法通过在源声学特征和目标声学特征之间进我们的方法侧重于保留语音内容,并依赖于生成网络来学习语音风格。我们首先使用多说话者语音语料库训练以语言特征,音高轮廓和说话者身份为条件的多扬声器SampleRNN模型。使用从源说话者提取的语言特征和音调轮廓以及目标说话者身份来生成语音转换语音。我们证明我们的系统能够进行多对多语音转换,而无需并行数据,从而实现广泛的应用。主观评估表明我们的方法优于传统的VC方法。
URL
https://arxiv.org/abs/1808.08311