Abstract
With the increasingly giant scales of (causal) large language models (LLMs), the inference efficiency comes as one of the core concerns along the improved performance. In contrast to the memory footprint, the latency bottleneck seems to be of greater importance as there can be billions of requests to a LLM (e.g., GPT-4) per day. The bottleneck is mainly due to the autoregressive innateness of LLMs, where tokens can only be generated sequentially during decoding. To alleviate the bottleneck, the idea of speculative execution, which originates from the field of computer architecture, is introduced to LLM decoding in a \textit{draft-then-verify} style. Under this regime, a sequence of tokens will be drafted in a fast pace by utilizing some heuristics, and then the tokens shall be verified in parallel by the LLM. As the costly sequential inference is parallelized, LLM decoding speed can be significantly boosted. Driven by the success of LLMs in recent couple of years, a growing literature in this direction has emerged. Yet, there lacks a position survey to summarize the current landscape and draw a roadmap for future development of this promising area. To meet this demand, we present the very first survey paper that reviews and unifies literature of speculative execution in LLMs (e.g., blockwise parallel decoding, speculative decoding, etc.) in a comprehensive framework and a systematic taxonomy. Based on the taxonomy, we present a critical review and comparative analysis of the current arts. Finally we highlight various key challenges and future directions to further develop the area.
Abstract (translated)
随着大型语言模型(LLMs)越来越大,提高性能的核心问题之一是推理效率。相比之下,内存开销似乎不太重要,因为每天可能有数十亿个请求到LLM(例如GPT-4)。瓶颈主要源于LLMs的自回归性质,其中在解码过程中只能按顺序生成标记。为了减轻瓶颈,借鉴计算机架构领域的思想,以“草案-验证”的方式引入了LLM解码中的speculative execution。在这种模式下,通过使用一些启发式方法,可以快速生成一系列标记,然后由LLM并行验证这些标记。随着成本sequential inference的并行化,LLM解码速度可以大幅提高。 在LLM在过去几年取得成功的情况下,这一方向出现了越来越多的文献。然而,目前尚缺乏一份全面的调查报告,总结当前格局并为未来这个有前景的领域的发展路线图。为了满足这一需求,我们提出了第一篇 survey 论文,它回顾和统一了LLMs中speculative execution(例如块式并行解码,speculative decoding等)的文獻,并建立了一个全面的框架和系统分类学。根据这一分类学,我们给出了对当前艺术的关键审查和比较分析。最后,我们强调了进一步发展和该领域的各种关键挑战和未来方向。
URL
https://arxiv.org/abs/2404.14897