Paper Reading AI Learner

DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models

2024-10-31 17:43:13
Heng-Jui Chang, Hongyu Gong, Changhan Wang, James Glass, Yu-An Chung

Abstract

Spoken language models (SLMs) have gained increasing attention with advancements in text-based, decoder-only language models. SLMs process text and speech, enabling simultaneous speech understanding and generation. This paper presents Double-Codebook Speaker-invariant Clustering (DC-Spin), which aims to improve speech tokenization by bridging audio signals and SLM tokens. DC-Spin extracts speaker-invariant tokens rich in phonetic information and resilient to input variations, enhancing zero-shot SLM tasks and speech resynthesis. We propose a chunk-wise approach to enable streamable DC-Spin without retraining and degradation. Comparisons of tokenization methods (self-supervised and neural audio codecs), model scalability, and downstream task proxies show that tokens easily modeled by an n-gram LM or aligned with phonemes offer strong performance, providing insights for designing speech tokenizers for SLMs.

Abstract (translated)

口语语言模型(SLMs)随着基于文本的仅解码器语言模型的进步而越来越受到关注。SLMs处理文本和语音,使同时进行语音理解和生成成为可能。本文介绍了双代码本说话人不变聚类(DC-Spin),旨在通过连接音频信号和SLM标记来改进语音标记化。DC-Spin提取富含音素信息且对输入变化具有鲁棒性的说话人不变标记,从而增强零样本SLM任务和语音重合成。我们提出了一种基于块的方法,以实现无需重新训练且不会性能下降的流式处理DC-Spin。通过比较不同的标记化方法(自监督和神经音频编解码器)、模型扩展性和下游任务代理,发现可以轻松由n-gram语言模型建模或与音素对齐的标记表现优异,为设计适用于SLMs的语音标记器提供了见解。

URL

https://arxiv.org/abs/2410.24177

PDF

https://arxiv.org/pdf/2410.24177.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot