Paper Reading AI Learner

Scaling Speech-Text Pre-training with Synthetic Interleaved Data

2024-11-26 17:19:09
Aohan Zeng, Zhengxiao Du, Mingdao Liu, Lei Zhang, Shengmin Jiang, Yuxiao Dong, Jie Tang

Abstract

Speech language models (SpeechLMs) accept speech input and produce speech output, allowing for more natural human-computer interaction compared to text-based large language models (LLMs). Traditional approaches for developing SpeechLMs are constrained by the limited availability of unsupervised speech data and parallel speech-text data, which are significantly less abundant than text pre-training data, thereby limiting their scalability as LLMs. We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora, eliminating the need for parallel speech-text datasets. Our method efficiently constructs speech-text interleaved data by sampling text spans from existing text corpora and synthesizing corresponding speech spans using a text-to-token model, bypassing the need to generate actual speech. We also employ a supervised speech tokenizer derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. This supervised training approach results in discrete speech tokens with strong semantic preservation even at lower sampling rates (e.g. 12.5Hz), while still maintaining speech reconstruction quality. Starting from a pre-trained language model and scaling our pre-training to 1 trillion tokens (with 600B synthetic interleaved speech-text data), we achieve state-of-the-art performance in speech language modeling and spoken question answering, improving performance on spoken questions tasks from the previous SOTA of 13% (Moshi) to 31%. We further demonstrate that by fine-tuning the pre-trained model with speech dialogue data, we can develop an end-to-end spoken chatbot that achieves competitive performance comparable to existing baselines in both conversational abilities and speech quality, even operating exclusively in the speech domain.

Abstract (translated)

语音语言模型(SpeechLMs)接受语音输入并生成语音输出,与基于文本的大规模语言模型(LLMs)相比,它们允许更自然的人机交互。传统开发SpeechLMs的方法受限于无监督语音数据和并行的语音-文本数据的有限可用性,这些数据比文本预训练数据要少得多,从而限制了它们作为LLMs的可扩展性。我们提出了一种新的方法来扩大语音-文本预训练规模,该方法利用从文本语料库中衍生出的大规模合成交错数据,无需并行的语音-文本数据集。我们的方法通过从现有文本语料库中抽样文本片段,并使用文本到令牌模型合成相应的语音片段来有效构建语音-文本交错数据,绕过了生成实际语音的需求。我们还采用了一种监督式的语音分词器,该分词器源自自动语音识别(ASR)模型,在编码器中加入了矢量量化瓶颈。这种监督训练方法即使在较低的采样率(例如12.5Hz)下也能保持较强的语义保存能力,并且仍然能保持语音重建质量。从预训练的语言模型开始,我们将预训练扩展到1万亿令牌(含600B合成交错语音-文本数据),实现了语音语言建模和口语问答的最先进性能,将之前的最高准确率(Moshi)从13%提高到了31%。我们进一步证明,通过使用语音对话数据微调预训练模型,我们可以开发出一个端到端的口语聊天机器人,在对话能力和语音质量方面都达到了与现有基线相当的竞争表现,即使它仅在语音领域操作。

URL

https://arxiv.org/abs/2411.17607

PDF

https://arxiv.org/pdf/2411.17607.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot