MSdocTr-Lite: A Lite Transformer for Full Page Multi-script Handwriting Recognition

Abstract
Abstract (translated)
URL
PDF

Abstract

The Transformer has quickly become the dominant architecture for various pattern recognition tasks due to its capacity for long-range representation. However, transformers are data-hungry models and need large datasets for training. In Handwritten Text Recognition (HTR), collecting a massive amount of labeled data is a complicated and expensive task. In this paper, we propose a lite transformer architecture for full-page multi-script handwriting recognition. The proposed model comes with three advantages: First, to solve the common problem of data scarcity, we propose a lite transformer model that can be trained on a reasonable amount of data, which is the case of most HTR public datasets, without the need for external data. Second, it can learn the reading order at page-level thanks to a curriculum learning strategy, allowing it to avoid line segmentation errors, exploit a larger context and reduce the need for costly segmentation annotations. Third, it can be easily adapted to other scripts by applying a simple transfer-learning process using only page-level labeled images. Extensive experiments on different datasets with different scripts (French, English, Spanish, and Arabic) show the effectiveness of the proposed model.

Abstract (translated)

Transformer 已经迅速成为各种模式识别任务的主要架构，因为它能够进行远程表示。然而，Transformer 是一个数据饥渴模型，需要大规模的数据进行训练。在手写文本识别(HTR)中，收集大量标记数据是一项复杂且昂贵的任务。在本文中，我们提出了一种简单的Transformer架构，用于全页多脚本手写文本识别。该模型有三项优点：第一，为了解决数据稀缺的共同问题，我们提出了一种简单的Transformer模型，可以在合理的数据量上进行训练，这是HTR公共数据集的一般情况，无需外部数据。第二，它可以通过课程学习策略在页面级别学习阅读顺序，避免线分割错误，利用更大的上下文，减少昂贵的分割注释需求。第三，它可以轻松适应其他脚本，通过仅使用页面级别标记图像的应用简单的迁移学习过程。对不同脚本不同数据集(法语、英语、西班牙语和阿拉伯语)进行广泛的实验表明，该模型的有效性。

URL

https://arxiv.org/abs/2303.13931

PDF

https://arxiv.org/pdf/2303.13931.pdf