Abstract
Current language models (LMs) use a fixed, static subword tokenizer. This choice, often taken for granted, typically results in degraded efficiency and capabilities in languages other than English, and makes it challenging to apply LMs to new domains or languages. To address these issues, we propose retrofitting LMs with dynamic tokenization: a way to dynamically decide on token boundaries based on the input text. For encoder-style models, we introduce a subword-merging algorithm inspired by byte-pair encoding (BPE), but at a batch level. We merge frequent subword sequences in a batch, then apply a pretrained embedding-prediction hypernetwork to compute the token embeddings on-the-fly. When applied with word-level boundaries, this on average reduces token sequence lengths by >20% across 14 languages on XNLI with XLM-R while degrading its task performance by less than 2%. For decoder-style models, we apply dynamic tokenization in two ways: 1) for prefilling, maintaining performance of Mistral-7B almost completely with up to 40% sequence reduction - relative to the word-level; and 2) via an approximate nearest neighbor index, achieving fast generation with a one million token vocabulary, demonstrating scalability to even larger, dynamic vocabularies. Overall, our findings show that dynamic tokenization substantially improves inference speed and promotes fairness across languages, making a leap towards overcoming the limitations of static tokenization and enabling more equitable and adaptable LMs.
Abstract (translated)
当前的语言模型(LMs)使用的是固定、静态的子词分词器。这个通常被认为是理所当然的选择,往往会降低除了英语之外其他语言的有效性和能力,并且使得将LMs应用于新领域或新语言变得具有挑战性。为了解决这些问题,我们提出了用动态分词来改造LMs:一种根据输入文本动态决定词边界的方法。对于编码器风格的模型,我们引入了一种受字节配对编码(BPE)启发的子词合并算法,但在批量级别上进行操作。我们在一个批次中合并频繁出现的子词序列,然后应用预训练的嵌入预测超网络来实时计算词嵌入。当以单词级别的边界应用时,在XNLI数据集上使用XLM-R模型平均减少了超过20%的14种语言的令牌序列长度,同时其任务性能下降不到2%。对于解码器风格的模型,我们通过两种方式应用动态分词:1)在预填充阶段,几乎完全保持Mistral-7B的性能,相对于单词级别最多可减少序列长度达40%;2)通过近似最近邻索引实现快速生成,使用一百万个令牌词汇表,展示了其扩展到更大、更动态词汇表的能力。总体而言,我们的研究结果表明,动态分词大大提高了推理速度,并促进了跨语言的公平性,朝着克服静态分词局限性的方向迈进了一大步,并使得LMs更加公正和适应性强。
URL
https://arxiv.org/abs/2411.18553