Paper Reading AI Learner

TTSOps: A Closed-Loop Corpus Optimization Framework for Training Multi-Speaker TTS Models from Dark Data

2025-06-18 16:44:28
Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari

Abstract

This paper presents TTSOps, a fully automated closed-loop framework for constructing multi-speaker text-to-speech (TTS) systems from noisy, uncurated web-scale speech data, often referred to as ``dark data,'' such as online videos. Conventional TTS training pipelines require well-curated corpora with high acoustic quality and accurate text-speech alignment, which severely limits scalability, speaker diversity, and real-world applicability. While recent studies have proposed acoustic-quality-based data selection techniques, they often overlook two critical aspects: (1) the inherent robustness of modern TTS models to noise, and (2) the potential contribution of perceptually low-quality yet informative samples. To address these issues, TTSOps introduces a data-centric training pipeline that integrates three core components: (1) automated data collection from dark data sources, (2) utterance-level dynamic selection of data cleansing methods based on training data quality, and (3) evaluation-in-the-loop data selection using automatically predicted mean opinion scores (MOS) to estimate each utterance's impact on model performance. Furthermore, TTSOps jointly optimizes the corpus and the TTS model in a closed-loop framework by dynamically adapting both data selection and data cleansing processes to the characteristics of the target TTS model. Extensive experiments on Japanese YouTube data demonstrate that TTSOps outperforms conventional acoustic-quality-based baselines in both the naturalness and speaker diversity of synthesized speech.

Abstract (translated)

本文介绍了TTSOps,这是一个完全自动化的闭环框架,用于从嘈杂且未经整理的网络规模语音数据(通常称为“暗数据”)中构建多说话人文本到语音(TTS)系统,例如在线视频。传统的TTS训练流水线需要高质量声学特性和准确的文字-语音对齐的精心策划语料库,这严重限制了其可扩展性、说话人的多样性以及实际应用能力。虽然最近的研究提出了基于音质的数据选择技术,但它们往往忽视两个关键方面:(1)现代TTS模型对噪声的内在鲁棒性;(2)低感知质量却具有信息价值样本的潜在贡献。 为了解决这些问题,TTSOps引入了一个以数据为中心的训练流水线,整合了三个核心组件:(1)从暗数据源自动收集数据;(2)根据训练数据的质量动态选择话语级的数据清理方法;以及(3)使用基于预测的平均意见评分(MOS)进行闭环内的话语选取评估,以估计每个话语对模型性能的影响。此外,TTSOps通过在闭环框架中动态调整数据选择和数据清理过程来联合优化语料库和TTS模型,以便适应目标TTS模型的特点。 在日本YouTube数据上进行了广泛的实验,结果表明TTSOps在合成语音的自然性和说话人多样性方面均优于传统的基于音质的数据基线。

URL

https://arxiv.org/abs/2506.15614

PDF

https://arxiv.org/pdf/2506.15614.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot