CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

2021-03-11 18:57:44

Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting

arXiv_CL

arXiv_CL QA Bert Transformer

Abstract
Abstract (translated)
URL
PDF

Abstract

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences--without explicit tokenization or vocabulary--and a pre-training strategy with soft inductive biases in place of hard token this http URL use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes con-text. CANINE outperforms a comparable mBERT model by >=1 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.

Abstract (translated)

URL

https://arxiv.org/abs/2103.06874

PDF

https://arxiv.org/pdf/2103.06874.pdf