Abstract
Despite the growing success of Large Speech Language Models (LSLMs) in processing short-term acoustic signals, their extension to long-form audio understanding is severely bottlenecked. This limitation stems from the limited context length and the exorbitant memory footprints required for long-form inference. In this work, we propose Speech-XL, a new model that capitalizes on the intrinsic key-value (KV) sparsification capacity of Large Language Models (LLMs) to achieve high-ratio speech input compression. Specifically, we introduce a novel special token, the Speech Summarization Token (SST), for each speech interval to encapsulate the intra-interval speech information into its associated KV pairs. The SST module is trained via instruction fine-tuning, employing a curriculum learning strategy where the SST learns to compress information in a progressive manner--advancing from low-ratio (simple) to high-ratio (challenging) compression. Despite utilizing significantly less training data than other baselines, our model achieves highly competitive performance on major benchmarks, including LongSpeech and AUDIOMARATHON. By addressing the long-standing bottlenecks in long-form audio modeling, our approach offers a novel perspective on the condensation of extensive acoustic sequences.
Abstract (translated)
尽管大型语音语言模型(LSLMs)在处理短期声学信号方面取得了成功,但它们向长音频理解的扩展却受到了严重限制。这种局限性主要源自于有限的上下文长度和进行长音频推理所需的巨大内存消耗。在这项工作中,我们提出了Speech-XL这一新模型,该模型利用了大型语言模型(LLMs)内在的关键值(KV)稀疏化能力来实现高比率语音输入压缩。具体而言,我们引入了一种新颖的特殊标记——语音摘要令牌(SST),用于每个语音区间,以将其内部的语音信息封装到与其相关的KV对中。SST模块通过指令微调进行训练,并采用一种渐进式的课程学习策略,在这种策略下,SST逐步从低比率(简单)压缩向高比率(复杂)压缩过渡。 尽管我们所用的训练数据远少于其他基线模型,我们的模型在包括LongSpeech和AUDIOMARATHON在内的主要基准测试中取得了极具竞争力的表现。通过解决长音频建模长期以来存在的瓶颈问题,我们的方法为大量声学序列的高度浓缩提供了新的视角。
URL
https://arxiv.org/abs/2602.05373