Paper Reading AI Learner

A comparative study of generative models for child voice conversion

2025-12-13 01:49:23
Protima Nomo Sudro, Anton Ragni, Thomas Hain

Abstract

Generative models are a popular choice for adult-to-adult voice conversion (VC) because of their efficient way of modelling unlabelled data. To this point their usefulness in producing children speech and in particular adult to child VC has not been investigated. For adult to child VC, four generative models are compared: diffusion model, flow based model, variational autoencoders, and generative adversarial network. Results show that although converted speech outputs produce by those models appear plausible, they exhibit insufficient similarity with the target speaker characteristics. We introduce an efficient frequency warping technique that can be applied to the output of models, and which shows significant reduction of the mismatch between adult and child. The output of all the models are evaluated using both objective and subjective measures. In particular we compare specific speaker pairing using a unique corpus collected for dubbing of children speech.

Abstract (translated)

生成模型由于其高效地处理未标记数据的方式,成为成人到成人语音转换(VC)的热门选择。然而,它们在生产儿童语言以及更具体地说,在成人到儿童的语音转换中的实用性尚未得到充分研究。 对于成人到儿童的语音转换任务,本文比较了四种生成模型:扩散模型、基于流的模型、变分自编码器和生成对抗网络。结果表明,尽管这些模型产生的合成语音在听感上似乎合理,但它们与目标说话人的特征相似度不足。我们引入了一种高效的频率扭曲技术,可以应用于模型输出,显著减少了成人声音和儿童声音之间的不匹配。 所有模型的输出都使用了客观和主观评价标准进行了评估,并特别利用了一个专门收集用于为儿童语言配音的独特语料库来比较特定的说话人配对情况。

URL

https://arxiv.org/abs/2512.12129

PDF

https://arxiv.org/pdf/2512.12129.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot