Paper Reading AI Learner

NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

2024-11-02 05:15:44
Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, Minlan Yu

Abstract

Online LLM inference powers many exciting applications such as intelligent chatbots and autonomous agents. Modern LLM inference engines widely rely on request batching to improve inference throughput, aiming to make it cost-efficient when running on expensive GPU accelerators. However, the limited GPU memory has largely limited the batch size achieved in practice, leaving significant GPU compute resources wasted. We present NEO, an online LLM inference system that offloads part of attention compute and KV cache states from the GPU to the local host CPU, effectively increasing the GPU batch size and thus inference throughput. To this end, NEO proposes asymmetric GPU-CPU pipelining and load-aware scheduling to balance GPU and CPU loads and fully utilize their compute and memory resources. We evaluate NEO on a wide range of workloads (i.e., code generation, text summarization), GPUs (i.e., T4, A10G, H100), and LLM models (i.e., 7B, 8B, 70B). NEO achieves up to 7.5$\times$, 26%, and 14% higher throughput compared to GPU-only approach on T4, A10G, and H100 GPUs, respectively, while maintaining the same latency; with more powerful CPUs, NEO achieves up to 79.3% throughput gain on A10G GPU.

Abstract (translated)

在线大型语言模型(LLM)推理支持了许多激动人心的应用,如智能聊天机器人和自主代理。现代的LLM推理引擎广泛依赖请求批处理来提高推理吞吐量,旨在使其在昂贵的GPU加速器上运行时具有成本效益。然而,有限的GPU内存极大地限制了实践中可达到的批量大小,导致大量的GPU计算资源被浪费。我们提出了NEO系统,一个在线的LLM推理系统,它将部分注意力计算和KV缓存状态从GPU卸载到本地主机CPU,从而有效增加GPU批处理大小并提升推理吞吐量。为此,NEO提出了非对称的GPU-CPU流水线和负载感知调度策略,以平衡GPU和CPU的负载,并充分利用其计算和内存资源。我们在广泛的负载(即代码生成、文本摘要)、不同的GPU(如T4, A10G, H100)以及LLM模型(如7B, 8B, 70B)上评估了NEO系统。相比仅使用GPU的方法,NEO在T4, A10G和H100 GPU上的吞吐量分别提高了最高达7.5倍、26%和14%,同时保持相同的延迟;配备更强大的CPU时,NEO在A10G GPU上实现了高达79.3%的吞吐量提升。

URL

https://arxiv.org/abs/2411.01142

PDF

https://arxiv.org/pdf/2411.01142.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot