Self-Selected Attention Span for Accelerating Large Language Model Inference

Abstract
Abstract (translated)
URL
PDF

Abstract

Large language models (LLMs) can solve challenging tasks. However, their inference computation on modern GPUs is highly inefficient due to the increasing number of tokens they must attend to as they generate new ones. To address this inefficiency, we capitalize on LLMs' problem-solving capabilities to optimize their own inference-time efficiency. We demonstrate with two specific tasks: (a) evaluating complex arithmetic expressions and (b) summarizing news articles. For both tasks, we create custom datasets to fine-tune an LLM. The goal of fine-tuning is twofold: first, to make the LLM learn to solve the evaluation or summarization task, and second, to train it to identify the minimal attention spans required for each step of the task. As a result, the fine-tuned model is able to convert these self-identified minimal attention spans into sparse attention masks on-the-fly during inference. We develop a custom CUDA kernel to take advantage of the reduced context to attend to. We demonstrate that using this custom CUDA kernel improves the throughput of LLM inference by 28%. Our work presents an end-to-end demonstration showing that training LLMs to self-select their attention spans speeds up autoregressive inference in solving real-world tasks.

Abstract (translated)

大语言模型（LLMs）可以解决具有挑战性的任务。然而，由于它们在生成新词时需要关注越来越多的标记，因此它们在现代GPU上的推理计算效率非常低。为解决这个问题，我们利用LLMs的解决问题能力来优化其自身的推理时间效率。我们通过两个具体的任务来展示我们的工作：（a）评估复杂的算术表达式，（b）总结新闻文章。对于这两个任务，我们创建了自定义数据集来微调LLM。微调的目标是双重目的：首先，使LLM学会解决评估或总结任务；其次，训练它识别出每个步骤所需的最低关注度。因此，微调后的模型能够在推理时动态地将自定义的最低关注度转换成稀疏的注意力掩码。我们开发了一个自定义CUDA核来利用减少关注度的优势。我们证明了使用这个自定义CUDA核可以提高LLM推理的吞吐量28%。我们的工作展示了将LLM的注意力选择问题与解决现实世界任务的自动重排推理相结合，可以大大提高模型的处理速度。

URL

https://arxiv.org/abs/2404.09336

PDF

https://arxiv.org/pdf/2404.09336.pdf

Self-Selected Attention Span for Accelerating Large Language Model Inference

Abstract

Abstract (translated)

URL

PDF Copy

PDF