Abstract
Spoken language models (SLMs) have gained increasing attention with advancements in text-based, decoder-only language models. SLMs process text and speech, enabling simultaneous speech understanding and generation. This paper presents Double-Codebook Speaker-invariant Clustering (DC-Spin), which aims to improve speech tokenization by bridging audio signals and SLM tokens. DC-Spin extracts speaker-invariant tokens rich in phonetic information and resilient to input variations, enhancing zero-shot SLM tasks and speech resynthesis. We propose a chunk-wise approach to enable streamable DC-Spin without retraining and degradation. Comparisons of tokenization methods (self-supervised and neural audio codecs), model scalability, and downstream task proxies show that tokens easily modeled by an n-gram LM or aligned with phonemes offer strong performance, providing insights for designing speech tokenizers for SLMs.
Abstract (translated)
口语语言模型(SLMs)随着基于文本的仅解码器语言模型的进步而越来越受到关注。SLMs处理文本和语音,使同时进行语音理解和生成成为可能。本文介绍了双代码本说话人不变聚类(DC-Spin),旨在通过连接音频信号和SLM标记来改进语音标记化。DC-Spin提取富含音素信息且对输入变化具有鲁棒性的说话人不变标记,从而增强零样本SLM任务和语音重合成。我们提出了一种基于块的方法,以实现无需重新训练且不会性能下降的流式处理DC-Spin。通过比较不同的标记化方法(自监督和神经音频编解码器)、模型扩展性和下游任务代理,发现可以轻松由n-gram语言模型建模或与音素对齐的标记表现优异,为设计适用于SLMs的语音标记器提供了见解。
URL
https://arxiv.org/abs/2410.24177