Paper Reading AI Learner

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

2024-10-30 14:53:22
Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu

Abstract

The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparsethat can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each attention head offline and dynamically build sparse indices based on the assigned pattern during inference. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of long-context LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. By evaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy. Our code is available at this https URL.

Abstract (translated)

大型语言模型(LLM)推理的计算挑战仍然是其广泛应用的主要障碍,尤其是随着提示长度的不断增加。由于注意力计算的二次复杂性,一个8B规模的LLM在单个A100 GPU上处理一个包含1M token的提示(即预填充阶段)需要花费30分钟。现有的加速预填充的方法往往无法在应用于长上下文LLM时保持可接受的准确性和效率。为了解决这一问题,我们引入了MInference(百万token推理),这是一种专为加速长序列处理的稀疏计算方法。具体来说,我们在长上下文注意力矩阵中识别出三种独特的模式——A形、竖线和块稀疏,这些模式可以用于在GPU上进行高效的稀疏计算。我们离线确定每种注意力头的最佳模式,并根据分配的模式在推理过程中动态构建稀疏索引。利用这些模式和稀疏索引,通过我们的优化过的GPU内核执行高效稀疏注意力计算,显著减少了长上下文LLM预填充阶段的延迟。我们的技术可以直接应用于现有的LLM,无需对预训练设置进行任何修改或额外微调。通过对包括InfiniteBench、RULER、PG-19和Needle In A Haystack在内的多种下游任务以及包括LLaMA-3-1M、GLM4-1M、Yi-200K、Phi-3-128K和Qwen2-128K在内的模型进行评估,我们证明了MInference在A100上可以将预填充的推理延迟最多减少到原来的十分之一,同时保持准确性。我们的代码可以在这个 https URL 下获取。

URL

https://arxiv.org/abs/2407.02490

PDF

https://arxiv.org/pdf/2407.02490.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot