Efficient infusion of self-supervised representations in Automatic Speech Recognition

Abstract
Abstract (translated)
URL
PDF

Abstract

Self-supervised learned (SSL) models such as Wav2vec and HuBERT yield state-of-the-art results on speech-related tasks. Given the effectiveness of such models, it is advantageous to use them in conventional ASR systems. While some approaches suggest incorporating these models as a trainable encoder or a learnable frontend, training such systems is extremely slow and requires a lot of computation cycles. In this work, we propose two simple approaches that use (1) framewise addition and (2) cross-attention mechanisms to efficiently incorporate the representations from the SSL model(s) into the ASR architecture, resulting in models that are comparable in size with standard encoder-decoder conformer systems while also avoiding the usage of SSL models during training. Our approach results in faster training and yields significant performance gains on the Librispeech and Tedlium datasets compared to baselines. We further provide detailed analysis and ablation studies that demonstrate the effectiveness of our approach.

Abstract (translated)

自监督学习（SSL）模型，如Wav2vec和HuBERT，在语音相关任务上取得了最先进的成果。鉴于这类模型的有效性，将它们应用于传统的ASR系统具有优势。虽然一些方法建议将这类模型作为可训练的编码器或可学习的前端，但训练这些系统非常耗时且需要大量的计算周期。在本文中，我们提出了两种简单的策略，即（1）框架级加法和（2）跨注意机制，将SSL模型的表示有效地融入ASR架构，从而实现与标准编码器-解码器紧凑系统相当大小的模型，并避免在训练过程中使用SSL模型。我们的方法使得训练更快，同时在Librispeech和Tedlium数据集上的性能相较于基线有了显著的提高。此外，我们还提供了详细的分析和消融实验，以证明我们方法的的有效性。

URL

https://arxiv.org/abs/2404.12628

PDF

https://arxiv.org/pdf/2404.12628.pdf

Efficient infusion of self-supervised representations in Automatic Speech Recognition

Abstract

Abstract (translated)

URL

PDF Copy

PDF