Abstract
Large language models (LLMs) are essential in natural language processing but often struggle with inference speed and computational efficiency, limiting real-time deployment. The key-value (KV) cache mechanism reduces computational overhead in transformer models, but challenges in maintaining contextual understanding remain. In this paper, we propose BUZZ, a novel KV caching algorithm that leverages structured contextual information to minimize cache memory usage while enhancing inference speed. BUZZ employs a beehive-structured sparse cache, incorporating a sliding window to capture recent information and dynamically segmenting historical tokens into chunks to prioritize important tokens in local neighborhoods. We evaluate BUZZ on four real-world datasets: CNN/Daily Mail, XSUM, Wikitext, and 10-QA. Our results demonstrate that BUZZ (1) reduces cache memory usage by $\textbf{2.5}\times$ in LLM inference while maintaining over 99% accuracy in long-text summarization, and (2) surpasses state-of-the-art performance in multi-document question answering by $\textbf{7.69%}$ under the same memory limit, where full cache methods encounter out-of-memory issues. Additionally, BUZZ achieves significant inference speedup with a $\log{n}$ time complexity. The code is available at this https URL.
Abstract (translated)
大型语言模型(LLMs)在自然语言处理中至关重要,但它们通常在推理速度和计算效率方面存在困难,限制了其实时部署。键值(KV)缓存机制减少了Transformer模型中的计算开销,但在维持上下文理解上仍然面临挑战。本文提出了BUZZ,一种新的KV缓存算法,该算法利用结构化的上下文信息来最小化缓存内存使用并提高推理速度。BUZZ采用蜂巢状稀疏缓存,结合滑动窗口捕捉最近的信息,并动态地将历史标记分段以优先处理局部邻域中的重要标记。我们在四个真实世界的数据集上评估了BUZZ:CNN/Daily Mail、XSUM、Wikitext 和 10-QA。结果表明,BUZZ(1)在LLM推理中减少缓存内存使用$\textbf{2.5}\times$,同时在长文本摘要中的准确率保持在99%以上;(2)在相同内存限制下,在多文档问答任务上超越现有最优表现$\textbf{7.69\%}$,而全量缓存方法则遇到内存不足的问题。此外,BUZZ实现了显著的推理加速,并具有$\log{n}$的时间复杂度。代码可在以下链接获取:[此处为提供的URL]。
URL
https://arxiv.org/abs/2410.23079