Paper Reading AI Learner

Self-Selected Attention Span for Accelerating Large Language Model Inference

2024-04-14 19:36:04
Tian Jin, Wanzin Yazar, Zifei Xu, Sayeh Sharify, Xin Wang

Abstract

Large language models (LLMs) can solve challenging tasks. However, their inference computation on modern GPUs is highly inefficient due to the increasing number of tokens they must attend to as they generate new ones. To address this inefficiency, we capitalize on LLMs' problem-solving capabilities to optimize their own inference-time efficiency. We demonstrate with two specific tasks: (a) evaluating complex arithmetic expressions and (b) summarizing news articles. For both tasks, we create custom datasets to fine-tune an LLM. The goal of fine-tuning is twofold: first, to make the LLM learn to solve the evaluation or summarization task, and second, to train it to identify the minimal attention spans required for each step of the task. As a result, the fine-tuned model is able to convert these self-identified minimal attention spans into sparse attention masks on-the-fly during inference. We develop a custom CUDA kernel to take advantage of the reduced context to attend to. We demonstrate that using this custom CUDA kernel improves the throughput of LLM inference by 28%. Our work presents an end-to-end demonstration showing that training LLMs to self-select their attention spans speeds up autoregressive inference in solving real-world tasks.

Abstract (translated)

大语言模型(LLMs)可以解决具有挑战性的任务。然而,由于它们在生成新词时需要关注越来越多的标记,因此它们在现代GPU上的推理计算效率非常低。为解决这个问题,我们利用LLMs的解决问题能力来优化其自身的推理时间效率。我们通过两个具体的任务来展示我们的工作:(a)评估复杂的算术表达式,(b)总结新闻文章。对于这两个任务,我们创建了自定义数据集来微调LLM。微调的目标是双重目的:首先,使LLM学会解决评估或总结任务;其次,训练它识别出每个步骤所需的最低关注度。因此,微调后的模型能够在推理时动态地将自定义的最低关注度转换成稀疏的注意力掩码。我们开发了一个自定义CUDA核来利用减少关注度的优势。我们证明了使用这个自定义CUDA核可以提高LLM推理的吞吐量28%。我们的工作展示了将LLM的注意力选择问题与解决现实世界任务的自动重排推理相结合,可以大大提高模型的处理速度。

URL

https://arxiv.org/abs/2404.09336

PDF

https://arxiv.org/pdf/2404.09336.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot