Paper Reading AI Learner

Speech-XL: Towards Long-Form Speech Understanding in Large Speech Language Models

2026-02-05 06:50:49
Haoqin Sun, Chenyang Lyu, Shiwan Zhao, Xuanfan Ni, Xiangyu Kong, Longyue Wang, Weihua Luo, Yong Qin

Abstract

Despite the growing success of Large Speech Language Models (LSLMs) in processing short-term acoustic signals, their extension to long-form audio understanding is severely bottlenecked. This limitation stems from the limited context length and the exorbitant memory footprints required for long-form inference. In this work, we propose Speech-XL, a new model that capitalizes on the intrinsic key-value (KV) sparsification capacity of Large Language Models (LLMs) to achieve high-ratio speech input compression. Specifically, we introduce a novel special token, the Speech Summarization Token (SST), for each speech interval to encapsulate the intra-interval speech information into its associated KV pairs. The SST module is trained via instruction fine-tuning, employing a curriculum learning strategy where the SST learns to compress information in a progressive manner--advancing from low-ratio (simple) to high-ratio (challenging) compression. Despite utilizing significantly less training data than other baselines, our model achieves highly competitive performance on major benchmarks, including LongSpeech and AUDIOMARATHON. By addressing the long-standing bottlenecks in long-form audio modeling, our approach offers a novel perspective on the condensation of extensive acoustic sequences.

Abstract (translated)

尽管大型语音语言模型(LSLMs)在处理短期声学信号方面取得了成功,但它们向长音频理解的扩展却受到了严重限制。这种局限性主要源自于有限的上下文长度和进行长音频推理所需的巨大内存消耗。在这项工作中,我们提出了Speech-XL这一新模型,该模型利用了大型语言模型(LLMs)内在的关键值(KV)稀疏化能力来实现高比率语音输入压缩。具体而言,我们引入了一种新颖的特殊标记——语音摘要令牌(SST),用于每个语音区间,以将其内部的语音信息封装到与其相关的KV对中。SST模块通过指令微调进行训练,并采用一种渐进式的课程学习策略,在这种策略下,SST逐步从低比率(简单)压缩向高比率(复杂)压缩过渡。 尽管我们所用的训练数据远少于其他基线模型,我们的模型在包括LongSpeech和AUDIOMARATHON在内的主要基准测试中取得了极具竞争力的表现。通过解决长音频建模长期以来存在的瓶颈问题,我们的方法为大量声学序列的高度浓缩提供了新的视角。

URL

https://arxiv.org/abs/2602.05373

PDF

https://arxiv.org/pdf/2602.05373.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot