Abstract
Many studies on (Offline) Handwritten Text Recognition (HTR) systems have focused on building state-of-the-art models for line recognition on small corpora. However, adding HTR capability to a large scale multilingual OCR system poses new challenges. This paper addresses three problems in building such systems: data, efficiency, and integration. Firstly, one of the biggest challenges is obtaining sufficient amounts of high quality training data. We address the problem by using online handwriting data collected for a large scale production online handwriting recognition system. We describe our image data generation pipeline and study how online data can be used to build HTR models. We show that the data improve the models significantly under the condition where only a small number of real images is available, which is usually the case for HTR models. It enables us to support a new script at substantially lower cost. Secondly, we propose a line recognition model based on neural networks without recurrent connections. The model achieves a comparable accuracy with LSTM-based models while allowing for better parallelism in training and inference. Finally, we present a simple way to integrate HTR models into an OCR system. These constitute a solution to bring HTR capability into a large scale OCR system.
Abstract (translated)
许多关于(离线)手写文本识别(HTR)系统的研究都集中在建立最先进的小语料库行识别模型上。然而,在大规模的多语言OCR系统中增加HTR功能带来了新的挑战。本文讨论了在构建这样的系统中的三个问题:数据、效率和集成。首先,最大的挑战之一是获取足够数量的高质量培训数据。我们使用为大规模生产的在线手写识别系统收集的在线手写数据来解决这个问题。我们描述了我们的图像数据生成管道,并研究了如何利用在线数据构建HTR模型。结果表明,在只有少量真实图像可用的情况下,数据对模型有显著的改善,这通常是HTR模型的情况。它使我们能够以更低的成本支持一个新脚本。其次,提出了一种基于神经网络的无重复连接线识别模型。该模型与基于LSTM的模型具有可比的精度,同时允许在训练和推理中实现更好的并行性。最后,我们提出了一种将HTR模型集成到OCR系统中的简单方法。这些都是将HTR能力引入大规模OCR系统的解决方案。
URL
https://arxiv.org/abs/1904.09150