Abstract
Speech language models (SpeechLMs) accept speech input and produce speech output, allowing for more natural human-computer interaction compared to text-based large language models (LLMs). Traditional approaches for developing SpeechLMs are constrained by the limited availability of unsupervised speech data and parallel speech-text data, which are significantly less abundant than text pre-training data, thereby limiting their scalability as LLMs. We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora, eliminating the need for parallel speech-text datasets. Our method efficiently constructs speech-text interleaved data by sampling text spans from existing text corpora and synthesizing corresponding speech spans using a text-to-token model, bypassing the need to generate actual speech. We also employ a supervised speech tokenizer derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. This supervised training approach results in discrete speech tokens with strong semantic preservation even at lower sampling rates (e.g. 12.5Hz), while still maintaining speech reconstruction quality. Starting from a pre-trained language model and scaling our pre-training to 1 trillion tokens (with 600B synthetic interleaved speech-text data), we achieve state-of-the-art performance in speech language modeling and spoken question answering, improving performance on spoken questions tasks from the previous SOTA of 13% (Moshi) to 31%. We further demonstrate that by fine-tuning the pre-trained model with speech dialogue data, we can develop an end-to-end spoken chatbot that achieves competitive performance comparable to existing baselines in both conversational abilities and speech quality, even operating exclusively in the speech domain.
Abstract (translated)
语音语言模型(SpeechLMs)接受语音输入并生成语音输出,与基于文本的大规模语言模型(LLMs)相比,它们允许更自然的人机交互。传统开发SpeechLMs的方法受限于无监督语音数据和并行的语音-文本数据的有限可用性,这些数据比文本预训练数据要少得多,从而限制了它们作为LLMs的可扩展性。我们提出了一种新的方法来扩大语音-文本预训练规模,该方法利用从文本语料库中衍生出的大规模合成交错数据,无需并行的语音-文本数据集。我们的方法通过从现有文本语料库中抽样文本片段,并使用文本到令牌模型合成相应的语音片段来有效构建语音-文本交错数据,绕过了生成实际语音的需求。我们还采用了一种监督式的语音分词器,该分词器源自自动语音识别(ASR)模型,在编码器中加入了矢量量化瓶颈。这种监督训练方法即使在较低的采样率(例如12.5Hz)下也能保持较强的语义保存能力,并且仍然能保持语音重建质量。从预训练的语言模型开始,我们将预训练扩展到1万亿令牌(含600B合成交错语音-文本数据),实现了语音语言建模和口语问答的最先进性能,将之前的最高准确率(Moshi)从13%提高到了31%。我们进一步证明,通过使用语音对话数据微调预训练模型,我们可以开发出一个端到端的口语聊天机器人,在对话能力和语音质量方面都达到了与现有基线相当的竞争表现,即使它仅在语音领域操作。
URL
https://arxiv.org/abs/2411.17607