Benchmarking LF-MMI, CTC and RNN-T Criteria for Streaming ASR

2020-11-09 21:34:38

Xiaohui Zhang, Frank Zhang, Chunxi Liu, Kjell Schubert, Julian Chan, Pradyot Prakash, Jun Liu, Ching-Feng Yeh, Fuchun Peng, Yatharth Saraf, Geoffrey Zweig

arXiv_SD

arXiv_SD Speech_Recognition RNN Recognition Inference Knowledge Transformer Speech

Abstract
Abstract (translated)
URL
PDF

Abstract

In this work, to measure the accuracy and efficiency for a latency-controlled streaming automatic speech recognition (ASR) application, we perform comprehensive evaluations on three popular training criteria: LF-MMI, CTC and RNN-T. In transcribing social media videos of 7 languages with training data 3K-14K hours, we conduct large-scale controlled experimentation across each criterion using identical datasets and encoder model architecture. We find that RNN-T has consistent wins in ASR accuracy, while CTC models excel at inference efficiency. Moreover, we selectively examine various modeling strategies for different training criteria, including modeling units, encoder architectures, pre-training, etc. Given such large-scale real-world streaming ASR application, to our best knowledge, we present the first comprehensive benchmark on these three widely used training criteria across a great many languages.

Abstract (translated)

URL

https://arxiv.org/abs/2011.04785

PDF

https://arxiv.org/pdf/2011.04785.pdf