Abstract
We address the design of a unified multilingual system for handwriting recognition. Most of multi- lingual systems rests on specialized models that are trained on a single language and one of them is selected at test time. While some recognition systems are based on a unified optical model, dealing with a unified language model remains a major issue, as traditional language models are generally trained on corpora composed of large word lexicons per language. Here, we bring a solution by con- sidering language models based on sub-lexical units, called multigrams. Dealing with multigrams strongly reduces the lexicon size and thus decreases the language model complexity. This makes pos- sible the design of an end-to-end unified multilingual recognition system where both a single optical model and a single language model are trained on all the languages. We discuss the impact of the language unification on each model and show that our system reaches state-of-the-art methods perfor- mance with a strong reduction of the complexity.
Abstract (translated)
我们致力于设计统一的多语言手写识别系统。大多数多语言系统依赖于使用单一语言训练的专业模型,其中一种是在测试时选择的。虽然一些识别系统基于统一的光学模型,但处理统一的语言模型仍然是一个主要问题,因为传统的语言模型通常是根据每种语言的大词汇词汇组成的语料库进行训练。在这里,我们通过考虑基于子词汇单元的语言模型(称为多图表)来提出解决方案。处理多图表大大减少了词典大小,从而降低了语言模型的复杂性。这使得可以设计端到端的统一多语言识别系统,其中单个光学模型和单个语言模型都在所有语言上进行训练。我们讨论了语言统一对每个模型的影响,并表明我们的系统达到了最先进的方法性能,同时大大降低了复杂性。
URL
https://arxiv.org/abs/1808.09183