Abstract
Many existing works on voice conversion (VC) tasks use automatic speech recognition (ASR) models for ensuring linguistic consistency between source and converted samples. However, for the low-data resource domains, training a high-quality ASR remains to be a challenging task. In this work, we propose a novel iterative way of improving both the ASR and VC models. We first train an ASR model which is used to ensure content preservation while training a VC model. In the next iteration, the VC model is used as a data augmentation method to further fine-tune the ASR model and generalize it to diverse speakers. By iteratively leveraging the improved ASR model to train VC model and vice-versa, we experimentally show improvement in both the models. Our proposed framework outperforms the ASR and one-shot VC baseline models on English singing and Hindi speech domains in subjective and objective evaluations in low-data resource settings.
Abstract (translated)
许多现有的语音转换任务(VC)使用自动语音识别(ASR)模型以确保源和转换样本之间的语言一致性。但对于低数据资源领域,训练高质量的ASR模型仍然是一个挑战性的任务。在本研究中,我们提出了一种新颖的迭代方法来改进ASR和VC模型。我们首先训练一种用于确保内容保留的ASR模型,然后在下一个迭代中,将VC模型用作数据增强方法,进一步优化ASR模型,并使其适用于多种说话者。通过迭代利用改进的ASR模型来训练VC模型,并反之亦然,我们实验性地展示了两个模型的进步。在我们提出的框架中,在低数据资源环境下,在英语唱歌和汉式语言 domains 中,我们的ASR模型和一次性的VC基线模型在主观和客观评估中表现优异。通过迭代利用改进的ASR模型来训练VC模型,我们同时也证明了两个模型的进步。
URL
https://arxiv.org/abs/2305.15055