Abstract
End-to-end speech recognition models are improved by incorporating external text sources, typically by fusion with an external language model. Such language models have to be retrained whenever the corpus of interest changes. Furthermore, since they store the entire corpus in their parameters, rare words can be challenging to recall. In this work, we propose augmenting a transducer-based ASR model with a retrieval language model, which directly retrieves from an external text corpus plausible completions for a partial ASR hypothesis. These completions are then integrated into subsequent predictions by an adapter, which is trained once, so that the corpus of interest can be switched without incurring the computational overhead of retraining. Our experiments show that the proposed model significantly improves the performance of a transducer baseline on a pair of question-answering datasets. Further, it outperforms shallow fusion on recognition of named entities by about 7 relative; when the two are combined, the relative improvement increases to 13%.
Abstract (translated)
端到端语音识别模型可以通过添加外部文本来源而改进,通常通过与外部语言模型 fusion 来实现。这些语言模型必须在感兴趣的语料库发生变化时重新训练。此外,因为它们会将整个语料库存储在他们的参数中,罕见的单词可能很难回忆。在这项工作中,我们建议将基于转换器的 ASR 模型与检索语言模型结合起来,该检索语言模型直接从外部文本语料库中检索可能的完成句子,以支持 partial ASR 假设。这些完成句子随后通过适配器被集成到后续的预测中,一次性训练了一次,因此感兴趣的语料库可以切换而不涉及重新训练的计算 overhead。我们的实验结果表明, proposed 模型 significantly improves the performance of a transducer baseline on two question-answering datasets. Furthermore, it outperforms shallow fusion by about 7 relative on the recognition of named entities. When the two are combined, the relative improvement increases to 13%.
URL
https://arxiv.org/abs/2303.10942