Abstract
Generative models are a popular choice for adult-to-adult voice conversion (VC) because of their efficient way of modelling unlabelled data. To this point their usefulness in producing children speech and in particular adult to child VC has not been investigated. For adult to child VC, four generative models are compared: diffusion model, flow based model, variational autoencoders, and generative adversarial network. Results show that although converted speech outputs produce by those models appear plausible, they exhibit insufficient similarity with the target speaker characteristics. We introduce an efficient frequency warping technique that can be applied to the output of models, and which shows significant reduction of the mismatch between adult and child. The output of all the models are evaluated using both objective and subjective measures. In particular we compare specific speaker pairing using a unique corpus collected for dubbing of children speech.
Abstract (translated)
生成模型由于其高效地处理未标记数据的方式,成为成人到成人语音转换(VC)的热门选择。然而,它们在生产儿童语言以及更具体地说,在成人到儿童的语音转换中的实用性尚未得到充分研究。 对于成人到儿童的语音转换任务,本文比较了四种生成模型:扩散模型、基于流的模型、变分自编码器和生成对抗网络。结果表明,尽管这些模型产生的合成语音在听感上似乎合理,但它们与目标说话人的特征相似度不足。我们引入了一种高效的频率扭曲技术,可以应用于模型输出,显著减少了成人声音和儿童声音之间的不匹配。 所有模型的输出都使用了客观和主观评价标准进行了评估,并特别利用了一个专门收集用于为儿童语言配音的独特语料库来比较特定的说话人配对情况。
URL
https://arxiv.org/abs/2512.12129