Paper Reading AI Learner

Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge

2024-05-01 00:46:22
Bin Xiao, Chunan Shi, Xiaonan Nie, Fan Yang, Xiangwei Deng, Lei Su, Weipeng Chen, Bin Cui

Abstract

Large language models (LLMs) suffer from low efficiency as the mismatch between the requirement of auto-regressive decoding and the design of most contemporary GPUs. Specifically, billions to trillions of parameters must be loaded to the GPU cache through its limited memory bandwidth for computation, but only a small batch of tokens is actually computed. Consequently, the GPU spends most of its time on memory transfer instead of computation. Recently, parallel decoding, a type of speculative decoding algorithms, is becoming more popular and has demonstrated impressive efficiency improvement in generation. It introduces extra decoding heads to large models, enabling them to predict multiple subsequent tokens simultaneously and verify these candidate continuations in a single decoding step. However, this approach deviates from the training objective of next token prediction used during pre-training, resulting in a low hit rate for candidate tokens. In this paper, we propose a new speculative decoding algorithm, Clover, which integrates sequential knowledge into the parallel decoding process. This enhancement improves the hit rate of speculators and thus boosts the overall efficiency. Clover transmits the sequential knowledge from pre-speculated tokens via the Regressive Connection, then employs an Attention Decoder to integrate these speculated tokens. Additionally, Clover incorporates an Augmenting Block that modifies the hidden states to better align with the purpose of speculative generation rather than next token prediction. The experiment results demonstrate that Clover outperforms the baseline by up to 91% on Baichuan-Small and 146% on Baichuan-Large, respectively, and exceeds the performance of the previously top-performing method, Medusa, by up to 37% on Baichuan-Small and 57% on Baichuan-Large, respectively.

Abstract (translated)

大语言模型(LLMs)在自动回归解码与大多数当代GPU设计之间的不匹配方面存在效率低的问题。具体来说,通过其有限计算内存带宽加载数十亿到数百亿个参数到GPU缓存中,但只有很少的token被实际计算。因此,GPU大部分时间都在内存传输上而不是计算。最近,并行解码,一种类 speculation decoding 算法,变得越来越受欢迎,并在生成方面取得了令人印象深刻的效率提升。它引入了额外的解码头到大型模型,使它们能够同时预测多个后续token,并在一个解码步骤中验证这些候选继续。然而,这种方法与用于预训练的下一个token预测的训练目标相偏离,导致对于候选token的命中率较低。在本文中,我们提出了一个新的类 speculation decoding 算法,Clover,该算法将序列知识集成到并行解码过程中。这种增强提高了投机者的命中率,从而提高了整体效率。Clover通过反向连接传输预speculated tokens的序列知识,然后采用注意力解码器将这些speculated tokens集成起来。此外,Clover还引入了一个增强块,用于修改隐藏状态,使其更符合预测投机生成而不是下一个token的预测。实验结果表明,Clover在巴ichuan-Small和巴ichuan-Large上的性能均优于基线,分别提高了91%和146%,超过了之前最佳方法Medusa在巴ichuan-Small和巴ichuan-Large上的性能,提高了37%和57%。

URL

https://arxiv.org/abs/2405.00263

PDF

https://arxiv.org/pdf/2405.00263.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot