Abstract
Large Language Models have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition is becoming a mainstream paradigm. Building upon this momentum, our research delves into an indepth examination of this paradigm on a large opensource Chinese dataset. Specifically, our research aims to evaluate the impact of various configurations of speech encoders, LLMs, and projector modules in the context of the speech foundation encoderLLM ASR paradigm. Furthermore, we introduce a threestage training approach, expressly developed to enhance the model's ability to align auditory and textual information. The implementation of this approach, alongside the strategic integration of ASR components, enabled us to achieve the SOTA performance on the AISHELL1, TestNet, and TestMeeting test sets. Our analysis presents an empirical foundation for future research in LLMbased ASR systems and offers insights into optimizing performance using Chinese datasets. We will publicly release all scripts used for data preparation, training, inference, and scoring, as well as pretrained models and training logs to promote reproducible research.
Abstract (translated)
大规模语言模型已经在各种自然语言处理任务中展示了无与伦比的效果,并将自动语音识别与大规模语言模型集成成为主流范式。在此基础上,我们的研究深入探讨了这个范式在一个大型开源中文数据集上的影响。具体来说,我们的研究旨在评估在语音基础编码器LLM ASR范式中各种配置的影响。此外,我们还介绍了一种三阶段训练方法,专门设计来增强模型对 align auditory 和文本信息的能力。这种方法的实现与ASR组件的策略集成,使我们能够在AISHELL1、TestNet和TestMeeting测试集上实现最先进的性能。我们的分析提供了LLM为基础的ASR系统未来研究的实证基础,并为使用中文数据集优化性能提供了见解。我们将公开发布所有用于数据准备、训练、推理和评分的脚本,以及预训练模型和训练日志,以促进可重复的研究。
URL
https://arxiv.org/abs/2405.02132