Anatomy of Industrial Scale Multilingual ASR

Abstract
Abstract (translated)
URL
PDF

Abstract

This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed description of our model architecture, consisting of a full-context 600M-parameter Conformer encoder pre-trained with BEST-RQ and an RNN-T decoder fine-tuned jointly with the encoder. Our extensive evaluation demonstrates competitive word error rates (WERs) against larger and more computationally expensive models, such as Whisper large and Canary-1B. Furthermore, our architectural choices yield several key advantages, including an improved code-switching capability, a 5x inference speedup compared to an optimized Whisper baseline, a 30% reduction in hallucination rate on speech data, and a 90% reduction in ambient noise compared to Whisper, along with significantly improved time-stamp accuracy. Throughout this work, we adopt a system-centric approach to analyzing various aspects of fully-fledged ASR models to gain practically relevant insights useful for real-world services operating at scale.

Abstract (translated)

本文描述了AssemblyAI设计的工业规模自动语音识别（ASR）系统，旨在满足大规模、多语言ASR满足各种应用需求的需求。我们的系统利用四个语言中的无监督（12.5M小时）、有监督（188k小时）和伪标签（1.6M小时）数据构建了多样化的训练数据集。我们详细描述了我们的模型架构，包括使用BEST-RQ预训练的完整上下文600M参数的Conformer编码器以及与编码器共同细化的RNN-T解码器。我们的大量评估显示，与更大、更昂贵的大型模型（如Whisper large和Canary-1B）相比，我们的具有竞争力的词错误率（WERs）。此外，我们的架构选择带来几个关键优势，包括提高的换挡能力、与优化后的Whisper基线相比的5倍推理速度、对语音数据的幻觉率降低30%以及与Whisper相比的90%的环境噪声降低。在本文中，我们采用系统中心的方法分析各种大規模ASR模型的各个方面，以获得与实际服务操作规模相关的实际相关见解，这些见解对于大规模服务至关重要。

URL

https://arxiv.org/abs/2404.09841

PDF

https://arxiv.org/pdf/2404.09841.pdf

Anatomy of Industrial Scale Multilingual ASR

Abstract

Abstract (translated)

URL

PDF Copy

PDF