Abstract
Recent advancements in language modeling have led to the emergence of Large Language Models (LLMs) capable of various natural language processing tasks. Despite their success in text-based tasks, applying LLMs to the speech domain remains limited and challenging. This paper presents BLOOMZMMS, a novel model that integrates a multilingual LLM with a multilingual speech encoder, aiming to harness the capabilities of LLMs for speech recognition and beyond. Utilizing a multi-instructional training approach, we demonstrate the transferability of linguistic knowledge from the text to the speech modality. Our experiments, conducted on 1900 hours of transcribed data from 139 languages, establish that a multilingual speech representation can be effectively learned and aligned with a multilingual LLM. While this learned representation initially shows limitations in task generalization, we address this issue by generating synthetic targets in a multi-instructional style. Our zero-shot evaluation results confirm the robustness of our approach across multiple tasks, including speech translation and multilingual spoken language understanding, thereby opening new avenues for applying LLMs in the speech domain.
Abstract (translated)
近年来在语言建模方面的进步导致了一种名为大型语言模型(LLMs)的多语言处理能力在各种自然语言处理任务中出现。尽管它们在文本基任务上的成功,但将LLMs应用于语音领域仍然具有挑战性和限制。本文介绍了一种名为BLOOMZMMS的新模型,该模型将多语言LLM与多语言语音编码器集成,旨在利用LLMs在语音识别和 beyond 的能力。利用多指令训练方法,我们证明了语言知识可以从文本模式转移到语音模式。我们对139个语言的1900小时转录数据进行的实验证明,多语言语音表示可以有效地学习和与多语言LLM对齐。虽然最初的学习表示在任务泛化方面存在局限性,但我们通过多指令风格生成合成目标来解决这个问题。我们的零击评估结果证实了我们的方法在多个任务上的稳健性,包括语音翻译和多语言口语理解,从而为在语音领域应用LLM提供了新的途径。
URL
https://arxiv.org/abs/2404.10922