Abstract
Large Language Models (LLMs) have made remarkable progress in processing extensive contexts, with the Key-Value (KV) cache playing a vital role in enhancing their performance. However, the growth of the KV cache in response to increasing input length poses challenges to memory and time efficiency. To address this problem, this paper introduces SnapKV, an innovative and fine-tuning-free approach that efficiently minimizes KV cache size while still delivering comparable performance in real-world applications. We discover that each attention head in the model consistently focuses on specific prompt attention features during generation. Meanwhile, this robust pattern can be obtained from an `observation' window located at the end of the prompts. Drawing on this insight, SnapKV automatically compresses KV caches by selecting clustered important KV positions for each attention head. Our approach significantly reduces the growing computational overhead and memory footprint when processing long input sequences. Specifically, SnapKV achieves a consistent decoding speed with a 3.6x increase in generation speed and an 8.2x enhancement in memory efficiency compared to baseline when processing inputs of 16K tokens. At the same time, it maintains comparable performance to baseline models across 16 long sequence datasets. Moreover, SnapKV can process up to 380K context tokens on a single A100-80GB GPU using HuggingFace implementation with minor changes, exhibiting only a negligible accuracy drop in the Needle-in-a-Haystack test. Further comprehensive studies suggest SnapKV's potential for practical applications.
Abstract (translated)
大规模语言模型(LLMs)在处理广泛语境方面取得了显著的进步,其中键值(KV)缓存起着关键作用,增强了其性能。然而,随着输入长度的增加,KV缓存的增长对内存和时间效率提出了挑战。为解决这个问题,本文引入了SnapKV,一种创新且无需微调的途径,它在高效减小KV缓存大小的同时,在现实应用中提供了与基线相当的表现。我们发现,模型中的每个注意力头在生成过程中始终关注特定的提示关注特征。同时,这个稳健的规律可以从位于提示末尾的观察窗口中获得。基于这个洞察,SnapKV通过选择聚类的关键KV位置对每个注意力头自动压缩KV缓存。我们的方法在处理长输入序列时显著减少了计算开销和内存足迹。具体来说,当处理16K个词的输入时,SnapKV实现了与基线相同的解码速度和8.2倍内存效率的提高。同时,它在与基线模型处理16个长序列数据集时的表现相当。此外,使用HuggingFace实现,SnapKV可以在单个A100-80GB的GPU上处理多达380K个上下文令牌,在 Needle-in-a-Haystack 测试中的准确度下降仅微不足道。进一步的研究表明,SnapKV在实际应用中具有很大的潜力。
URL
https://arxiv.org/abs/2404.14469