Abstract
The popular subword tokenizers of current language models, such as Byte-Pair Encoding (BPE), are known not to respect morpheme boundaries, which affects the downstream performance of the models. While many improved tokenization algorithms have been proposed, their evaluation and cross-comparison is still an open problem. As a solution, we propose a combined intrinsic-extrinsic evaluation framework for subword tokenization. Intrinsic evaluation is based on our new UniMorph Labeller tool that classifies subword tokenization as either morphological or alien. Extrinsic evaluation, in turn, is performed via the Out-of-Vocabulary Generalization Challenge 1.0 benchmark, which consists of three newly specified downstream text classification tasks. Our empirical findings show that the accuracy of UniMorph Labeller is 98%, and that, in all language models studied (including ALBERT, BERT, RoBERTa, and DeBERTa), alien tokenization leads to poorer generalizations compared to morphological tokenization for semantic compositionality of word meanings.
Abstract (translated)
目前流行的语言模型子词划分器,如Byte-Pair Encoding(BPE)等,已知不尊重语素边界,这会影响模型的下游性能。虽然已经提出了许多改进的词素划分算法,但它们的评估和跨比仍是未解决的问题。为了解决问题,我们提出了一个结合内部和外部评估的词素划分框架。内部评估基于我们新的UniMorph Labeller工具,将词素划分分为语素或外星。外部评估通过Out-of-Vocabulary Generalization Challenge 1.0基准进行,该基准包括三个新的下游文本分类任务。我们的实证研究结果表明,UniMorph Labeller的准确率为98%,而在所有研究语言模型中(包括ALBERT、BERT、RoBERTa和DeBERTa),外星词素划分会导致语义组成性较弱,与语素词分相比更差。
URL
https://arxiv.org/abs/2404.13292