Abstract
Large self-supervised pre-trained speech models have achieved remarkable success across various speech-processing tasks. The self-supervised training of these models leads to universal speech representations that can be used for different downstream tasks, ranging from automatic speech recognition (ASR) to speaker identification. Recently, Whisper, a transformer-based model was proposed and trained on large amount of weakly supervised data for ASR; it outperformed several state-of-the-art self-supervised models. Given the superiority of Whisper for ASR, in this paper we explore the transferability of the representation for four other speech tasks in SUPERB benchmark. Moreover, we explore the robustness of Whisper representation for ``in the wild'' tasks where speech is corrupted by environment noise and room reverberation. Experimental results show Whisper achieves promising results across tasks and environmental conditions, thus showing potential for cross-task real-world deployment.
Abstract (translated)
大型自监督预训练语音模型在各种语音处理任务中取得了显著的成功。自监督训练这些模型导致通用语音表示,可用于各种后续任务,包括自动语音识别(ASR)和语音识别(ASR)。最近,Whisper模型被提出并训练了大量弱监督的ASR数据;它 outperform 了几个最先进的自监督模型。鉴于Whisper对ASR的优越性,在本文中,我们探索SuperB基准中其他四个语音任务的表示是否可以转移。此外,我们还探索“在野外”任务中,语音受到环境噪声和房间回音污染时Whisper表示的鲁棒性。实验结果表明,Whisper在不同任务和环境条件下取得了令人期望的结果,因此显示了跨任务现实世界部署的潜力。
URL
https://arxiv.org/abs/2305.14546