Abstract
Discrete speech tokens have been more and more popular in multiple speech processing fields, including automatic speech recognition (ASR), text-to-speech (TTS) and singing voice synthesis (SVS). In this paper, we describe the systems developed by the SJTU X-LANCE group for the TTS (acoustic + vocoder), SVS, and ASR tracks in the Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge. Notably, we achieved 1st rank on the leaderboard in the TTS track both with the whole training set and only 1h training data, along with the lowest bitrate among all submissions.
Abstract (translated)
离散语音词已经成为了多个语音处理领域(包括自动语音识别(ASR)、文本转语音(TTS)和唱歌语音合成(SVS))的热门选择。在本文中,我们描述了西安交通大学X-LANCE团队为Interspeech 2024 语音处理使用离散语音单元挑战中的TTS(声学+编码器)、SVS和ASR项目开发的系统。值得注意的是,我们在整个训练集和仅用1小时训练数据的情况下,在TTS track的排行榜上获得了第一,同时拥有所有提交作品中的最低带宽。
URL
https://arxiv.org/abs/2404.06079