Abstract
Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting large language models. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications often require multiple responses and how to perform speculative decoding in a batched setting while preserving its latency benefits poses non-trivial challenges. This paper describes a system of batched speculative decoding that sets a new state of the art in multi-sequence generation latency and that demonstrates superior GPU utilization as well as quality of generations within a time budget. For example, for a 7.8B-size model on a single A100 GPU and with a batch size of 8, each sequence is generated at an average speed of 5.8ms per token, the overall throughput being 1.1K tokens per second. These results represent state-of-the-art latency and a 2.15X speed-up over optimized regular decoding. Within a time budget that regular decoding does not finish, our system is able to generate sequences with HumanEval Pass@First of 43% and Pass@All of 61%, far exceeding what's feasible with single-sequence speculative decoding. Our peak GPU utilization during decoding reaches as high as 15.8%, more than 3X the highest of that of regular decoding and around 10X of single-sequence speculative decoding.
Abstract (translated)
推测解码已成为提高在主机大语言模型中的延迟和吞吐量的强大方法。然而,大多数现有实现都关注生成单个序列。现实世界的生成性人工智能应用程序通常需要多个响应。在批处理设置中,如何进行批处理的推测解码以及如何在保持延迟优势的同时实现出色的GPU利用率仍然具有非 trivial 挑战。本文描述了一个批处理的推测解码系统,在多序列生成延迟方面达到了新的技术水平,并展示了GPU利用率以及在不超过预设时间预算内的生成质量的优越性。例如,在单个A100 GPU上,每个序列的生成平均速度为5.8ms,总吞吐量为1.1K Token/s。这些结果代表了最先进的延迟,以及优化常规解码的2.15倍速度。在预设时间预算内,我们的系统能够生成具有人类评估Pass@First 43%和Pass@All 61%的序列,远远超过了单序列推测解码的可实现水平。在解码过程中,我们的GPU利用率达到最高点,达到15.8%,比普通解码的顶峰高约3倍,比单序列推测解码的顶峰高约10倍。
URL
https://arxiv.org/abs/2404.15778