Self-supervised Pre-training of Text Recognizers

Abstract
Abstract (translated)
URL
PDF

Abstract

In this paper, we investigate self-supervised pre-training methods for document text recognition. Nowadays, large unlabeled datasets can be collected for many research tasks, including text recognition, but it is costly to annotate them. Therefore, methods utilizing unlabeled data are researched. We study self-supervised pre-training methods based on masked label prediction using three different approaches -- Feature Quantization, VQ-VAE, and Post-Quantized AE. We also investigate joint-embedding approaches with VICReg and NT-Xent objectives, for which we propose an image shifting technique to prevent model collapse where it relies solely on positional encoding while completely ignoring the input image. We perform our experiments on historical handwritten (Bentham) and historical printed datasets mainly to investigate the benefits of the self-supervised pre-training techniques with different amounts of annotated target domain data. We use transfer learning as strong baselines. The evaluation shows that the self-supervised pre-training on data from the target domain is very effective, but it struggles to outperform transfer learning from closely related domains. This paper is one of the first researches exploring self-supervised pre-training in document text recognition, and we believe that it will become a cornerstone for future research in this area. We made our implementation of the investigated methods publicly available at this https URL.

Abstract (translated)

在本文中，我们研究了用于文档文本识别的自监督预训练方法。目前，可以收集许多研究任务的大规模未标注数据集，包括文本识别，但是 annotate它们成本高昂。因此，我们研究了利用未标注数据的方法。我们研究了基于遮罩标签预测的自监督预训练方法，使用了三种不同的方法 - 特征量化、VQ-VAE和后量化AE。我们还研究了使用VICReg和NT-Xent目标，以及我们提出的图像平移技术，以防止模型过拟合，其中模型仅依赖于位置编码而完全忽视输入图像。我们在历史手写（Bentham）和印刷数据集上进行实验，主要研究了不同数量标注目标领域数据的自监督预训练方法的益处。我们使用迁移学习作为强基线。评估显示，来自目标领域的数据的自监督预训练非常有效，但很难在相关领域中超越迁移学习。本文是第一个研究文档文本识别中自监督预训练的论文，我们认为它将成为未来研究领域的基石。我们将研究方法的实现公开发布在本文的https:// URL上。

URL

https://arxiv.org/abs/2405.00420

PDF

https://arxiv.org/pdf/2405.00420.pdf

Self-supervised Pre-training of Text Recognizers

Abstract

Abstract (translated)

URL

PDF Copy

PDF