Abstract
Reasoning large language models exhibit complex reasoning behaviors through the extended chain-of-thought generation, creating unprecedented Key-Value (KV) cache overhead during the decoding phase. Existing KV cache compression methods underperform on reasoning models: token-dropping methods break reasoning integrity by discarding critical information, while head-reallocating methods mistakenly compress reasoning-critical heads since they are designed for retrieval tasks, resulting in significant performance degradation as compression rates increase. We hypothesize that KV heads exhibit functional heterogeneity in reasoning models-some heads are critical for chain-of-thought consistency while others are compressible. To validate and exploit this insight, we propose RLKV, a novel reasoning-critical head identification framework, which uses reinforcement learning to directly optimize the relationship between each head's cache usage and reasoning quality. As RLKV produces rewards from actual generated samples during training, it naturally identifies heads relevant to reasoning behaviors. We then allocate full KV cache to these heads while applying compressed constant KV cache to others for efficient inference. Our experiments reveal that only a small fraction of attention heads is essential for reasoning, enabling our KV compression approach to outperform baseline methods while achieving 20-50% cache reduction with near lossless performance compared to uncompressed results.
Abstract (translated)
推理型大型语言模型通过扩展的链式思维生成展示复杂的推理行为,在解码阶段产生前所未有的键值(KV)缓存开销。现有的KV缓存压缩方法在推理模型上表现不佳:丢弃令牌的方法破坏了推理的完整性,而头部重新分配的方法错误地将关键用于推理的头部进行压缩,因为它们是为检索任务设计的,随着压缩率的增加导致性能显著下降。我们假设,在推理模型中KV头部具有功能异质性——有些头部对链式思维的一致性至关重要,而另一些则可以被压缩。 为了验证并利用这一见解,我们提出了RLKV,这是一个新的用于识别推理关键头的新框架,该框架使用强化学习直接优化每个头部的缓存使用与其推理质量之间的关系。由于RLKV在训练过程中从实际生成的样本中产生奖励,它自然地识别出与推理行为相关的头部。然后我们将完整的KV缓存分配给这些头部,而对其他头部应用压缩的常量KV缓存,以实现高效的推断。 我们的实验表明,只有小部分注意力头对于推理是至关重要的,使我们的KV压缩方法在基线方法上取得了优势,并且相比未压缩结果实现了20-50%的缓存减少,同时保持了几乎无损的性能。
URL
https://arxiv.org/abs/2510.08525