Paper Reading AI Learner

Iteratively Improving Speech Recognition and Voice Conversion

2023-05-24 11:45:42
Mayank Kumar Singh, Naoya Takahashi, Onoe Naoyuki

Abstract

Many existing works on voice conversion (VC) tasks use automatic speech recognition (ASR) models for ensuring linguistic consistency between source and converted samples. However, for the low-data resource domains, training a high-quality ASR remains to be a challenging task. In this work, we propose a novel iterative way of improving both the ASR and VC models. We first train an ASR model which is used to ensure content preservation while training a VC model. In the next iteration, the VC model is used as a data augmentation method to further fine-tune the ASR model and generalize it to diverse speakers. By iteratively leveraging the improved ASR model to train VC model and vice-versa, we experimentally show improvement in both the models. Our proposed framework outperforms the ASR and one-shot VC baseline models on English singing and Hindi speech domains in subjective and objective evaluations in low-data resource settings.

Abstract (translated)

许多现有的语音转换任务(VC)使用自动语音识别(ASR)模型以确保源和转换样本之间的语言一致性。但对于低数据资源领域,训练高质量的ASR模型仍然是一个挑战性的任务。在本研究中,我们提出了一种新颖的迭代方法来改进ASR和VC模型。我们首先训练一种用于确保内容保留的ASR模型,然后在下一个迭代中,将VC模型用作数据增强方法,进一步优化ASR模型,并使其适用于多种说话者。通过迭代利用改进的ASR模型来训练VC模型,并反之亦然,我们实验性地展示了两个模型的进步。在我们提出的框架中,在低数据资源环境下,在英语唱歌和汉式语言 domains 中,我们的ASR模型和一次性的VC基线模型在主观和客观评估中表现优异。通过迭代利用改进的ASR模型来训练VC模型,我们同时也证明了两个模型的进步。

URL

https://arxiv.org/abs/2305.15055

PDF

https://arxiv.org/pdf/2305.15055.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot