Paper Reading AI Learner

Teaching a Multilingual Large Language Model to Understand Multilingual Speech via Multi-Instructional Training

2024-04-16 21:45:59
Pavel Denisov, Ngoc Thang Vu

Abstract

Recent advancements in language modeling have led to the emergence of Large Language Models (LLMs) capable of various natural language processing tasks. Despite their success in text-based tasks, applying LLMs to the speech domain remains limited and challenging. This paper presents BLOOMZMMS, a novel model that integrates a multilingual LLM with a multilingual speech encoder, aiming to harness the capabilities of LLMs for speech recognition and beyond. Utilizing a multi-instructional training approach, we demonstrate the transferability of linguistic knowledge from the text to the speech modality. Our experiments, conducted on 1900 hours of transcribed data from 139 languages, establish that a multilingual speech representation can be effectively learned and aligned with a multilingual LLM. While this learned representation initially shows limitations in task generalization, we address this issue by generating synthetic targets in a multi-instructional style. Our zero-shot evaluation results confirm the robustness of our approach across multiple tasks, including speech translation and multilingual spoken language understanding, thereby opening new avenues for applying LLMs in the speech domain.

Abstract (translated)

近年来在语言建模方面的进步导致了一种名为大型语言模型(LLMs)的多语言处理能力在各种自然语言处理任务中出现。尽管它们在文本基任务上的成功,但将LLMs应用于语音领域仍然具有挑战性和限制。本文介绍了一种名为BLOOMZMMS的新模型,该模型将多语言LLM与多语言语音编码器集成,旨在利用LLMs在语音识别和 beyond 的能力。利用多指令训练方法,我们证明了语言知识可以从文本模式转移到语音模式。我们对139个语言的1900小时转录数据进行的实验证明,多语言语音表示可以有效地学习和与多语言LLM对齐。虽然最初的学习表示在任务泛化方面存在局限性,但我们通过多指令风格生成合成目标来解决这个问题。我们的零击评估结果证实了我们的方法在多个任务上的稳健性,包括语音翻译和多语言口语理解,从而为在语音领域应用LLM提供了新的途径。

URL

https://arxiv.org/abs/2404.10922

PDF

https://arxiv.org/pdf/2404.10922.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot