Paper Reading AI Learner

Which Heads Matter for Reasoning? RL-Guided KV Cache Compression

2025-10-09 17:50:00
Wenjie Du, Li Jiang, Keda Tao, Xue Liu, Huan Wang

Abstract

Reasoning large language models exhibit complex reasoning behaviors through the extended chain-of-thought generation, creating unprecedented Key-Value (KV) cache overhead during the decoding phase. Existing KV cache compression methods underperform on reasoning models: token-dropping methods break reasoning integrity by discarding critical information, while head-reallocating methods mistakenly compress reasoning-critical heads since they are designed for retrieval tasks, resulting in significant performance degradation as compression rates increase. We hypothesize that KV heads exhibit functional heterogeneity in reasoning models-some heads are critical for chain-of-thought consistency while others are compressible. To validate and exploit this insight, we propose RLKV, a novel reasoning-critical head identification framework, which uses reinforcement learning to directly optimize the relationship between each head's cache usage and reasoning quality. As RLKV produces rewards from actual generated samples during training, it naturally identifies heads relevant to reasoning behaviors. We then allocate full KV cache to these heads while applying compressed constant KV cache to others for efficient inference. Our experiments reveal that only a small fraction of attention heads is essential for reasoning, enabling our KV compression approach to outperform baseline methods while achieving 20-50% cache reduction with near lossless performance compared to uncompressed results.

Abstract (translated)

推理型大型语言模型通过扩展的链式思维生成展示复杂的推理行为,在解码阶段产生前所未有的键值(KV)缓存开销。现有的KV缓存压缩方法在推理模型上表现不佳:丢弃令牌的方法破坏了推理的完整性,而头部重新分配的方法错误地将关键用于推理的头部进行压缩,因为它们是为检索任务设计的,随着压缩率的增加导致性能显著下降。我们假设,在推理模型中KV头部具有功能异质性——有些头部对链式思维的一致性至关重要,而另一些则可以被压缩。 为了验证并利用这一见解,我们提出了RLKV,这是一个新的用于识别推理关键头的新框架,该框架使用强化学习直接优化每个头部的缓存使用与其推理质量之间的关系。由于RLKV在训练过程中从实际生成的样本中产生奖励,它自然地识别出与推理行为相关的头部。然后我们将完整的KV缓存分配给这些头部,而对其他头部应用压缩的常量KV缓存,以实现高效的推断。 我们的实验表明,只有小部分注意力头对于推理是至关重要的,使我们的KV压缩方法在基线方法上取得了优势,并且相比未压缩结果实现了20-50%的缓存减少,同时保持了几乎无损的性能。

URL

https://arxiv.org/abs/2510.08525

PDF

https://arxiv.org/pdf/2510.08525.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot