Abstract
This paper introduces Easy One-Step Text-to-Speech (E1 TTS), an efficient non-autoregressive zero-shot text-to-speech system based on denoising diffusion pretraining and distribution matching distillation. The training of E1 TTS is straightforward; it does not require explicit monotonic alignment between the text and audio pairs. The inference of E1 TTS is efficient, requiring only one neural network evaluation for each utterance. Despite its sampling efficiency, E1 TTS achieves naturalness and speaker similarity comparable to various strong baseline models. Audio samples are available at this http URL .
Abstract (translated)
本文介绍了一种名为Easy One-Step Text-to-Speech(E1 TTS)的高效非自回归零 shot 文本到语音(TTS)系统,该系统基于去噪扩散预训练和分布匹配蒸馏。E1 TTS的训练是直接的;它不需要文本和音频对之间的显式 monotonic 对齐。E1 TTS的推理是高效的,只需要对每个语音进行一次神经网络评估。尽管它的采样效率很高,但E1 TTS实现了与各种强大基线模型相当的 naturalness(自然)和 speaker similarity(说话者相似性)。音频样本可在此链接 http:// 上获取。
URL
https://arxiv.org/abs/2409.09351