Abstract
Speech recognition and translation systems perform poorly on noisy inputs, which are frequent in realistic environments. Augmenting these systems with visual signals has the potential to improve robustness to noise. However, audio-visual (AV) data is only available in limited amounts and for fewer languages than audio-only resources. To address this gap, we present XLAVS-R, a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation in over 100 languages. It is designed to maximize the benefits of limited multilingual AV pre-training data, by building on top of audio-only multilingual pre-training and simplifying existing pre-training schemes. Extensive evaluation on the MuAViC benchmark shows the strength of XLAVS-R on downstream audio-visual speech recognition and translation tasks, where it outperforms the previous state of the art by up to 18.5% WER and 4.7 BLEU given noisy AV inputs, and enables strong zero-shot audio-visual ability with audio-only fine-tuning.
Abstract (translated)
语音识别和翻译系统在嘈杂的输入中表现不佳,这些输入在现实环境中很常见。通过视觉信号增强这些系统有望提高对抗噪声的能力。然而,音频视觉(AV)数据仅限于有限数量,并且支持的语言比音频资源更少。为了填补这个空白,我们提出了XLAVS-R,一种跨语言的音频视觉语音表示模型,用于在100多种语言中进行噪声抗性的语音识别和翻译。它旨在利用有限的多语言AV预训练数据,通过在音频only多语言预训练的基础上简化现有的预训练方案。对MuAViC基准进行的广泛评估显示,XLAVS-R在下游音频视觉语音识别和翻译任务中的实力,其在嘈杂AV输入上的性能优于前人水平,其中在嘈杂AV输入上的WER提高了18.5%,BLEU提高了4.7倍。此外,XLAVS-R还具有与音频only微调相同的零散音频视觉能力。
URL
https://arxiv.org/abs/2403.14402