Abstract
Large language models (LLMs) suffer from low efficiency as the mismatch between the requirement of auto-regressive decoding and the design of most contemporary GPUs. Specifically, billions to trillions of parameters must be loaded to the GPU cache through its limited memory bandwidth for computation, but only a small batch of tokens is actually computed. Consequently, the GPU spends most of its time on memory transfer instead of computation. Recently, parallel decoding, a type of speculative decoding algorithms, is becoming more popular and has demonstrated impressive efficiency improvement in generation. It introduces extra decoding heads to large models, enabling them to predict multiple subsequent tokens simultaneously and verify these candidate continuations in a single decoding step. However, this approach deviates from the training objective of next token prediction used during pre-training, resulting in a low hit rate for candidate tokens. In this paper, we propose a new speculative decoding algorithm, Clover, which integrates sequential knowledge into the parallel decoding process. This enhancement improves the hit rate of speculators and thus boosts the overall efficiency. Clover transmits the sequential knowledge from pre-speculated tokens via the Regressive Connection, then employs an Attention Decoder to integrate these speculated tokens. Additionally, Clover incorporates an Augmenting Block that modifies the hidden states to better align with the purpose of speculative generation rather than next token prediction. The experiment results demonstrate that Clover outperforms the baseline by up to 91% on Baichuan-Small and 146% on Baichuan-Large, respectively, and exceeds the performance of the previously top-performing method, Medusa, by up to 37% on Baichuan-Small and 57% on Baichuan-Large, respectively.
Abstract (translated)
大语言模型(LLMs)在自动回归解码与大多数当代GPU设计之间的不匹配方面存在效率低的问题。具体来说,通过其有限计算内存带宽加载数十亿到数百亿个参数到GPU缓存中,但只有很少的token被实际计算。因此,GPU大部分时间都在内存传输上而不是计算。最近,并行解码,一种类 speculation decoding 算法,变得越来越受欢迎,并在生成方面取得了令人印象深刻的效率提升。它引入了额外的解码头到大型模型,使它们能够同时预测多个后续token,并在一个解码步骤中验证这些候选继续。然而,这种方法与用于预训练的下一个token预测的训练目标相偏离,导致对于候选token的命中率较低。在本文中,我们提出了一个新的类 speculation decoding 算法,Clover,该算法将序列知识集成到并行解码过程中。这种增强提高了投机者的命中率,从而提高了整体效率。Clover通过反向连接传输预speculated tokens的序列知识,然后采用注意力解码器将这些speculated tokens集成起来。此外,Clover还引入了一个增强块,用于修改隐藏状态,使其更符合预测投机生成而不是下一个token的预测。实验结果表明,Clover在巴ichuan-Small和巴ichuan-Large上的性能均优于基线,分别提高了91%和146%,超过了之前最佳方法Medusa在巴ichuan-Small和巴ichuan-Large上的性能,提高了37%和57%。
URL
https://arxiv.org/abs/2405.00263