Paper Reading AI Learner

BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference

2024-10-30 14:53:37
Junqi Zhao, Zhijin Fang, Shu Li, Shaohui Yang, Shichao He

Abstract

Large language models (LLMs) are essential in natural language processing but often struggle with inference speed and computational efficiency, limiting real-time deployment. The key-value (KV) cache mechanism reduces computational overhead in transformer models, but challenges in maintaining contextual understanding remain. In this paper, we propose BUZZ, a novel KV caching algorithm that leverages structured contextual information to minimize cache memory usage while enhancing inference speed. BUZZ employs a beehive-structured sparse cache, incorporating a sliding window to capture recent information and dynamically segmenting historical tokens into chunks to prioritize important tokens in local neighborhoods. We evaluate BUZZ on four real-world datasets: CNN/Daily Mail, XSUM, Wikitext, and 10-QA. Our results demonstrate that BUZZ (1) reduces cache memory usage by $\textbf{2.5}\times$ in LLM inference while maintaining over 99% accuracy in long-text summarization, and (2) surpasses state-of-the-art performance in multi-document question answering by $\textbf{7.69%}$ under the same memory limit, where full cache methods encounter out-of-memory issues. Additionally, BUZZ achieves significant inference speedup with a $\log{n}$ time complexity. The code is available at this https URL.

Abstract (translated)

大型语言模型(LLMs)在自然语言处理中至关重要,但它们通常在推理速度和计算效率方面存在困难,限制了其实时部署。键值(KV)缓存机制减少了Transformer模型中的计算开销,但在维持上下文理解上仍然面临挑战。本文提出了BUZZ,一种新的KV缓存算法,该算法利用结构化的上下文信息来最小化缓存内存使用并提高推理速度。BUZZ采用蜂巢状稀疏缓存,结合滑动窗口捕捉最近的信息,并动态地将历史标记分段以优先处理局部邻域中的重要标记。我们在四个真实世界的数据集上评估了BUZZ:CNN/Daily Mail、XSUM、Wikitext 和 10-QA。结果表明,BUZZ(1)在LLM推理中减少缓存内存使用$\textbf{2.5}\times$,同时在长文本摘要中的准确率保持在99%以上;(2)在相同内存限制下,在多文档问答任务上超越现有最优表现$\textbf{7.69\%}$,而全量缓存方法则遇到内存不足的问题。此外,BUZZ实现了显著的推理加速,并具有$\log{n}$的时间复杂度。代码可在以下链接获取:[此处为提供的URL]。

URL

https://arxiv.org/abs/2410.23079

PDF

https://arxiv.org/pdf/2410.23079.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot