Paper Reading AI Learner

Retrofitting Language Models with Dynamic Tokenization

2024-11-27 17:51:58
Darius Feher, Benjamin Minixhofer, Ivan Vuli\'c

Abstract

Current language models (LMs) use a fixed, static subword tokenizer. This choice, often taken for granted, typically results in degraded efficiency and capabilities in languages other than English, and makes it challenging to apply LMs to new domains or languages. To address these issues, we propose retrofitting LMs with dynamic tokenization: a way to dynamically decide on token boundaries based on the input text. For encoder-style models, we introduce a subword-merging algorithm inspired by byte-pair encoding (BPE), but at a batch level. We merge frequent subword sequences in a batch, then apply a pretrained embedding-prediction hypernetwork to compute the token embeddings on-the-fly. When applied with word-level boundaries, this on average reduces token sequence lengths by >20% across 14 languages on XNLI with XLM-R while degrading its task performance by less than 2%. For decoder-style models, we apply dynamic tokenization in two ways: 1) for prefilling, maintaining performance of Mistral-7B almost completely with up to 40% sequence reduction - relative to the word-level; and 2) via an approximate nearest neighbor index, achieving fast generation with a one million token vocabulary, demonstrating scalability to even larger, dynamic vocabularies. Overall, our findings show that dynamic tokenization substantially improves inference speed and promotes fairness across languages, making a leap towards overcoming the limitations of static tokenization and enabling more equitable and adaptable LMs.

Abstract (translated)

当前的语言模型(LMs)使用的是固定、静态的子词分词器。这个通常被认为是理所当然的选择,往往会降低除了英语之外其他语言的有效性和能力,并且使得将LMs应用于新领域或新语言变得具有挑战性。为了解决这些问题,我们提出了用动态分词来改造LMs:一种根据输入文本动态决定词边界的方法。对于编码器风格的模型,我们引入了一种受字节配对编码(BPE)启发的子词合并算法,但在批量级别上进行操作。我们在一个批次中合并频繁出现的子词序列,然后应用预训练的嵌入预测超网络来实时计算词嵌入。当以单词级别的边界应用时,在XNLI数据集上使用XLM-R模型平均减少了超过20%的14种语言的令牌序列长度,同时其任务性能下降不到2%。对于解码器风格的模型,我们通过两种方式应用动态分词:1)在预填充阶段,几乎完全保持Mistral-7B的性能,相对于单词级别最多可减少序列长度达40%;2)通过近似最近邻索引实现快速生成,使用一百万个令牌词汇表,展示了其扩展到更大、更动态词汇表的能力。总体而言,我们的研究结果表明,动态分词大大提高了推理速度,并促进了跨语言的公平性,朝着克服静态分词局限性的方向迈进了一大步,并使得LMs更加公正和适应性强。

URL

https://arxiv.org/abs/2411.18553

PDF

https://arxiv.org/pdf/2411.18553.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot