Paper Reading AI Learner

Efficient Interleaved Speech Modeling through Knowledge Distillation

2025-06-30 09:47:37
Mohammadmahdi Nouriborji, Morteza Rohanian

Abstract

Current speech language models exceed the size and latency constraints of many deployment environments. We build compact, expressive speech generation models through layer-aligned distillation, matching hidden states, attention maps, and softened logits to compress large multimodal transformers by 3x with minimal loss in performance. We introduce TinyWave, a family of 2B-parameter models for speech-to-speech and interleaved speech-text generation, trained on 50,000 hours of public audio. TinyWave supports (i) speech-only generation using phonetic or expressive tokens and (ii) mixed speech-text continuations. Evaluation on Libri-Light shows TinyWave within 1.4 normalized perplexity points of its teacher. Accuracy on spoken StoryCloze and SALMon reaches 93-97% of the teacher's performance, outperforming size-matched baselines. These models are optimized for deployment on commodity hardware, enabling applications in real-time conversational agents, assistive technologies, and low-resource environments. We release models, training code, and evaluation scripts to support reproducible research on compact, expressive speech generation.

Abstract (translated)

当前的语音语言模型超出了许多部署环境中的大小和延迟限制。我们通过层对齐蒸馏、匹配隐藏状态、注意力图以及软化后的logits,将大型多模态变压器压缩了3倍,并且性能几乎没有损失。我们介绍了TinyWave,这是一个参数量为20亿的模型家族,用于语音到语音及交错语音文本生成,其训练数据集包含了5万小时的公开音频。TinyWave支持(i)仅使用音素或表达性标记进行语音生成;(ii)混合语音-文本延续。在Libri-Light上的评估显示,TinyWave的困惑度比它的教师模型只低1.4个标准化点。对于口语StoryCloze和SALMon任务,准确率达到了老师模型性能的93%-97%,超过了与之大小相匹配的基础模型。这些模型针对商品硬件进行了优化,使其实时对话代理、辅助技术和资源匮乏环境中的应用成为可能。我们发布了模型、训练代码以及评估脚本以支持关于紧凑且表达性语音生成的研究复现。

URL

https://arxiv.org/abs/2506.23670

PDF

https://arxiv.org/pdf/2506.23670.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot