Iteratively Improving Speech Recognition and Voice Conversion

Abstract
Abstract (translated)
URL
PDF

Abstract

Many existing works on voice conversion (VC) tasks use automatic speech recognition (ASR) models for ensuring linguistic consistency between source and converted samples. However, for the low-data resource domains, training a high-quality ASR remains to be a challenging task. In this work, we propose a novel iterative way of improving both the ASR and VC models. We first train an ASR model which is used to ensure content preservation while training a VC model. In the next iteration, the VC model is used as a data augmentation method to further fine-tune the ASR model and generalize it to diverse speakers. By iteratively leveraging the improved ASR model to train VC model and vice-versa, we experimentally show improvement in both the models. Our proposed framework outperforms the ASR and one-shot VC baseline models on English singing and Hindi speech domains in subjective and objective evaluations in low-data resource settings.

Abstract (translated)

许多现有的语音转换任务(VC)使用自动语音识别(ASR)模型以确保源和转换样本之间的语言一致性。但对于低数据资源领域，训练高质量的ASR模型仍然是一个挑战性的任务。在本研究中，我们提出了一种新颖的迭代方法来改进ASR和VC模型。我们首先训练一种用于确保内容保留的ASR模型，然后在下一个迭代中，将VC模型用作数据增强方法，进一步优化ASR模型，并使其适用于多种说话者。通过迭代利用改进的ASR模型来训练VC模型，并反之亦然，我们实验性地展示了两个模型的进步。在我们提出的框架中，在低数据资源环境下，在英语唱歌和汉式语言 domains 中，我们的ASR模型和一次性的VC基线模型在主观和客观评估中表现优异。通过迭代利用改进的ASR模型来训练VC模型，我们同时也证明了两个模型的进步。

URL

https://arxiv.org/abs/2305.15055

PDF

https://arxiv.org/pdf/2305.15055.pdf