Abstract
Online LLM inference powers many exciting applications such as intelligent chatbots and autonomous agents. Modern LLM inference engines widely rely on request batching to improve inference throughput, aiming to make it cost-efficient when running on expensive GPU accelerators. However, the limited GPU memory has largely limited the batch size achieved in practice, leaving significant GPU compute resources wasted. We present NEO, an online LLM inference system that offloads part of attention compute and KV cache states from the GPU to the local host CPU, effectively increasing the GPU batch size and thus inference throughput. To this end, NEO proposes asymmetric GPU-CPU pipelining and load-aware scheduling to balance GPU and CPU loads and fully utilize their compute and memory resources. We evaluate NEO on a wide range of workloads (i.e., code generation, text summarization), GPUs (i.e., T4, A10G, H100), and LLM models (i.e., 7B, 8B, 70B). NEO achieves up to 7.5$\times$, 26%, and 14% higher throughput compared to GPU-only approach on T4, A10G, and H100 GPUs, respectively, while maintaining the same latency; with more powerful CPUs, NEO achieves up to 79.3% throughput gain on A10G GPU.
Abstract (translated)
在线大型语言模型(LLM)推理支持了许多激动人心的应用,如智能聊天机器人和自主代理。现代的LLM推理引擎广泛依赖请求批处理来提高推理吞吐量,旨在使其在昂贵的GPU加速器上运行时具有成本效益。然而,有限的GPU内存极大地限制了实践中可达到的批量大小,导致大量的GPU计算资源被浪费。我们提出了NEO系统,一个在线的LLM推理系统,它将部分注意力计算和KV缓存状态从GPU卸载到本地主机CPU,从而有效增加GPU批处理大小并提升推理吞吐量。为此,NEO提出了非对称的GPU-CPU流水线和负载感知调度策略,以平衡GPU和CPU的负载,并充分利用其计算和内存资源。我们在广泛的负载(即代码生成、文本摘要)、不同的GPU(如T4, A10G, H100)以及LLM模型(如7B, 8B, 70B)上评估了NEO系统。相比仅使用GPU的方法,NEO在T4, A10G和H100 GPU上的吞吐量分别提高了最高达7.5倍、26%和14%,同时保持相同的延迟;配备更强大的CPU时,NEO在A10G GPU上实现了高达79.3%的吞吐量提升。
URL
https://arxiv.org/abs/2411.01142