Paper Reading AI Learner

Self-supervised Pre-training of Text Recognizers

2024-05-01 09:58:57
Martin Kišš, Michal Hradiš

Abstract

In this paper, we investigate self-supervised pre-training methods for document text recognition. Nowadays, large unlabeled datasets can be collected for many research tasks, including text recognition, but it is costly to annotate them. Therefore, methods utilizing unlabeled data are researched. We study self-supervised pre-training methods based on masked label prediction using three different approaches -- Feature Quantization, VQ-VAE, and Post-Quantized AE. We also investigate joint-embedding approaches with VICReg and NT-Xent objectives, for which we propose an image shifting technique to prevent model collapse where it relies solely on positional encoding while completely ignoring the input image. We perform our experiments on historical handwritten (Bentham) and historical printed datasets mainly to investigate the benefits of the self-supervised pre-training techniques with different amounts of annotated target domain data. We use transfer learning as strong baselines. The evaluation shows that the self-supervised pre-training on data from the target domain is very effective, but it struggles to outperform transfer learning from closely related domains. This paper is one of the first researches exploring self-supervised pre-training in document text recognition, and we believe that it will become a cornerstone for future research in this area. We made our implementation of the investigated methods publicly available at this https URL.

Abstract (translated)

在本文中,我们研究了用于文档文本识别的自监督预训练方法。目前,可以收集许多研究任务的大规模未标注数据集,包括文本识别,但是 annotate它们成本高昂。因此,我们研究了利用未标注数据的方法。我们研究了基于遮罩标签预测的自监督预训练方法,使用了三种不同的方法 - 特征量化、VQ-VAE和后量化AE。我们还研究了使用VICReg和NT-Xent目标,以及我们提出的图像平移技术,以防止模型过拟合,其中模型仅依赖于位置编码而完全忽视输入图像。我们在历史手写(Bentham)和印刷数据集上进行实验,主要研究了不同数量标注目标领域数据的自监督预训练方法的益处。我们使用迁移学习作为强基线。评估显示,来自目标领域的数据的自监督预训练非常有效,但很难在相关领域中超越迁移学习。本文是第一个研究文档文本识别中自监督预训练的论文,我们认为它将成为未来研究领域的基石。我们将研究方法的实现公开发布在本文的https:// URL上。

URL

https://arxiv.org/abs/2405.00420

PDF

https://arxiv.org/pdf/2405.00420.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot