Abstract
In short video and live broadcasts, speech, singing voice, and background music often overlap and obscure each other. This complexity creates difficulties in structuring and recognizing the audio content, which may impair subsequent ASR and music understanding applications. This paper proposes a multi-task audio source separation (MTASS) based ASR model called JRSV, which Jointly Recognizes Speech and singing Voices. Specifically, the MTASS module separates the mixed audio into distinct speech and singing voice tracks while removing background music. The CTC/attention hybrid recognition module recognizes both tracks. Online distillation is proposed to improve the robustness of recognition further. To evaluate the proposed methods, a benchmark dataset is constructed and released. Experimental results demonstrate that JRSV can significantly improve recognition accuracy on each track of the mixed audio.
Abstract (translated)
简短的视频和现场直播中,说话声、歌唱声和背景音乐经常重叠并掩盖彼此。这种复杂性使得对音频内容的组织和识别带来了困难,这可能会影响到后续的ASR和音乐理解应用程序。本文提出了一种基于多任务音频源分离(MTASS)的ASR模型,称为JRSV,它同时识别说话和歌唱声音。具体来说,MTASS模块将混合音频分离为不同的说话和歌唱声道,并去除了背景音乐。CTC/attention混合识别模块同时识别这两条轨道。提出了在线去噪以进一步提高识别的鲁棒性。为了评估所提出的方法,构建了一个基准数据集并发布。实验结果表明,JRSV可以在混合音频的每个轨道上显著提高识别准确性。
URL
https://arxiv.org/abs/2404.11275