DynaMo: Accelerating Language Model Inference with Dynamic Multi-Token Sampling

Abstract
Abstract (translated)
URL
PDF

Abstract

Traditional language models operate autoregressively, i.e., they predict one token at a time. Rapid explosion in model sizes has resulted in high inference times. In this work, we propose DynaMo, a suite of multi-token prediction language models that reduce net inference times. Our models $\textit{dynamically}$ predict multiple tokens based on their confidence in the predicted joint probability distribution. We propose a lightweight technique to train these models, leveraging the weights of traditional autoregressive counterparts. Moreover, we propose novel ways to enhance the estimated joint probability to improve text generation quality, namely co-occurrence weighted masking and adaptive thresholding. We also propose systematic qualitative and quantitative methods to rigorously test the quality of generated text for non-autoregressive generation. One of the models in our suite, DynaMo-7.3B-T3, achieves same-quality generated text as the baseline (Pythia-6.9B) while achieving 2.57$\times$ speed-up with only 5.87% and 2.67% parameter and training time overheads, respectively.

Abstract (translated)

传统语言模型是自回归的，也就是说，它们一次预测一个单词。模型大小迅速膨胀导致了推理时间的高。在这项工作中，我们提出了DynaMo，一套多词预测语言模型，减少了网络推理时间。我们的模型根据预测联合概率分布的自信程度动态预测多个单词。我们提出了一个轻量级的方法来训练这些模型，利用传统自回归同行的权重。此外，我们还提出了提高估计联合概率以改善文本生成质量的新方法，包括共现加权掩码和自适应阈值。我们还提出了系统性的定性和定量方法来严格测试生成文本的质量，对于非自回归生成。我们 suite 中的一个模型 DynaMo-7.3B-T3，在实现与基线（Pythia-6.9B）相同质量的生成文本的同时，仅比基线提高了5.87%的参数和训练时间，实现了2.67%的提速。

URL

https://arxiv.org/abs/2405.00888

PDF

https://arxiv.org/pdf/2405.00888.pdf

DynaMo: Accelerating Language Model Inference with Dynamic Multi-Token Sampling

Abstract

Abstract (translated)

URL

PDF Copy

PDF