Paper Reading AI Learner

SnapKV: LLM Knows What You are Looking for Before Generation

2024-04-22 17:42:58
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen

Abstract

Large Language Models (LLMs) have made remarkable progress in processing extensive contexts, with the Key-Value (KV) cache playing a vital role in enhancing their performance. However, the growth of the KV cache in response to increasing input length poses challenges to memory and time efficiency. To address this problem, this paper introduces SnapKV, an innovative and fine-tuning-free approach that efficiently minimizes KV cache size while still delivering comparable performance in real-world applications. We discover that each attention head in the model consistently focuses on specific prompt attention features during generation. Meanwhile, this robust pattern can be obtained from an `observation' window located at the end of the prompts. Drawing on this insight, SnapKV automatically compresses KV caches by selecting clustered important KV positions for each attention head. Our approach significantly reduces the growing computational overhead and memory footprint when processing long input sequences. Specifically, SnapKV achieves a consistent decoding speed with a 3.6x increase in generation speed and an 8.2x enhancement in memory efficiency compared to baseline when processing inputs of 16K tokens. At the same time, it maintains comparable performance to baseline models across 16 long sequence datasets. Moreover, SnapKV can process up to 380K context tokens on a single A100-80GB GPU using HuggingFace implementation with minor changes, exhibiting only a negligible accuracy drop in the Needle-in-a-Haystack test. Further comprehensive studies suggest SnapKV's potential for practical applications.

Abstract (translated)

大规模语言模型(LLMs)在处理广泛语境方面取得了显著的进步,其中键值(KV)缓存起着关键作用,增强了其性能。然而,随着输入长度的增加,KV缓存的增长对内存和时间效率提出了挑战。为解决这个问题,本文引入了SnapKV,一种创新且无需微调的途径,它在高效减小KV缓存大小的同时,在现实应用中提供了与基线相当的表现。我们发现,模型中的每个注意力头在生成过程中始终关注特定的提示关注特征。同时,这个稳健的规律可以从位于提示末尾的观察窗口中获得。基于这个洞察,SnapKV通过选择聚类的关键KV位置对每个注意力头自动压缩KV缓存。我们的方法在处理长输入序列时显著减少了计算开销和内存足迹。具体来说,当处理16K个词的输入时,SnapKV实现了与基线相同的解码速度和8.2倍内存效率的提高。同时,它在与基线模型处理16个长序列数据集时的表现相当。此外,使用HuggingFace实现,SnapKV可以在单个A100-80GB的GPU上处理多达380K个上下文令牌,在 Needle-in-a-Haystack 测试中的准确度下降仅微不足道。进一步的研究表明,SnapKV在实际应用中具有很大的潜力。

URL

https://arxiv.org/abs/2404.14469

PDF

https://arxiv.org/pdf/2404.14469.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot