Abstract
The Transformer has quickly become the dominant architecture for various pattern recognition tasks due to its capacity for long-range representation. However, transformers are data-hungry models and need large datasets for training. In Handwritten Text Recognition (HTR), collecting a massive amount of labeled data is a complicated and expensive task. In this paper, we propose a lite transformer architecture for full-page multi-script handwriting recognition. The proposed model comes with three advantages: First, to solve the common problem of data scarcity, we propose a lite transformer model that can be trained on a reasonable amount of data, which is the case of most HTR public datasets, without the need for external data. Second, it can learn the reading order at page-level thanks to a curriculum learning strategy, allowing it to avoid line segmentation errors, exploit a larger context and reduce the need for costly segmentation annotations. Third, it can be easily adapted to other scripts by applying a simple transfer-learning process using only page-level labeled images. Extensive experiments on different datasets with different scripts (French, English, Spanish, and Arabic) show the effectiveness of the proposed model.
Abstract (translated)
Transformer 已经迅速成为各种模式识别任务的主要架构,因为它能够进行远程表示。然而,Transformer 是一个数据饥渴模型,需要大规模的数据进行训练。在手写文本识别(HTR)中,收集大量标记数据是一项复杂且昂贵的任务。在本文中,我们提出了一种简单的Transformer架构,用于全页多脚本手写文本识别。该模型有三项优点:第一,为了解决数据稀缺的共同问题,我们提出了一种简单的Transformer模型,可以在合理的数据量上进行训练,这是HTR公共数据集的一般情况,无需外部数据。第二,它可以通过课程学习策略在页面级别学习阅读顺序,避免线分割错误,利用更大的上下文,减少昂贵的分割注释需求。第三,它可以轻松适应其他脚本,通过仅使用页面级别标记图像的应用简单的迁移学习过程。对不同脚本不同数据集(法语、英语、西班牙语和阿拉伯语)进行广泛的实验表明,该模型的有效性。
URL
https://arxiv.org/abs/2303.13931