Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge

2024-04-20 06:49:15

Khuyagbaatar Batsuren, Ekaterina Vylomova, Verna Dankers, Tsetsuukhei Delgerbaatar, Omri Uzan, Yuval Pinter, Gábor Bella

arXiv_AI

arXiv_AI Classification Text_Classification Language_Model Bert Pose

Abstract
Abstract (translated)
URL
PDF

Abstract

The popular subword tokenizers of current language models, such as Byte-Pair Encoding (BPE), are known not to respect morpheme boundaries, which affects the downstream performance of the models. While many improved tokenization algorithms have been proposed, their evaluation and cross-comparison is still an open problem. As a solution, we propose a combined intrinsic-extrinsic evaluation framework for subword tokenization. Intrinsic evaluation is based on our new UniMorph Labeller tool that classifies subword tokenization as either morphological or alien. Extrinsic evaluation, in turn, is performed via the Out-of-Vocabulary Generalization Challenge 1.0 benchmark, which consists of three newly specified downstream text classification tasks. Our empirical findings show that the accuracy of UniMorph Labeller is 98%, and that, in all language models studied (including ALBERT, BERT, RoBERTa, and DeBERTa), alien tokenization leads to poorer generalizations compared to morphological tokenization for semantic compositionality of word meanings.

Abstract (translated)

目前流行的语言模型子词划分器，如Byte-Pair Encoding（BPE）等，已知不尊重语素边界，这会影响模型的下游性能。虽然已经提出了许多改进的词素划分算法，但它们的评估和跨比仍是未解决的问题。为了解决问题，我们提出了一个结合内部和外部评估的词素划分框架。内部评估基于我们新的UniMorph Labeller工具，将词素划分分为语素或外星。外部评估通过Out-of-Vocabulary Generalization Challenge 1.0基准进行，该基准包括三个新的下游文本分类任务。我们的实证研究结果表明，UniMorph Labeller的准确率为98%，而在所有研究语言模型中（包括ALBERT、BERT、RoBERTa和DeBERTa），外星词素划分会导致语义组成性较弱，与语素词分相比更差。

URL

https://arxiv.org/abs/2404.13292

PDF

https://arxiv.org/pdf/2404.13292.pdf

Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge

Abstract

Abstract (translated)

URL

PDF Copy

PDF