Paper Reading AI Learner

MSdocTr-Lite: A Lite Transformer for Full Page Multi-script Handwriting Recognition

2023-03-24 11:40:50
Marwa Dhiaf, Ahmed Cheikh Rouhou, Yousri Kessentini, Sinda Ben Salem

Abstract

The Transformer has quickly become the dominant architecture for various pattern recognition tasks due to its capacity for long-range representation. However, transformers are data-hungry models and need large datasets for training. In Handwritten Text Recognition (HTR), collecting a massive amount of labeled data is a complicated and expensive task. In this paper, we propose a lite transformer architecture for full-page multi-script handwriting recognition. The proposed model comes with three advantages: First, to solve the common problem of data scarcity, we propose a lite transformer model that can be trained on a reasonable amount of data, which is the case of most HTR public datasets, without the need for external data. Second, it can learn the reading order at page-level thanks to a curriculum learning strategy, allowing it to avoid line segmentation errors, exploit a larger context and reduce the need for costly segmentation annotations. Third, it can be easily adapted to other scripts by applying a simple transfer-learning process using only page-level labeled images. Extensive experiments on different datasets with different scripts (French, English, Spanish, and Arabic) show the effectiveness of the proposed model.

Abstract (translated)

Transformer 已经迅速成为各种模式识别任务的主要架构,因为它能够进行远程表示。然而,Transformer 是一个数据饥渴模型,需要大规模的数据进行训练。在手写文本识别(HTR)中,收集大量标记数据是一项复杂且昂贵的任务。在本文中,我们提出了一种简单的Transformer架构,用于全页多脚本手写文本识别。该模型有三项优点:第一,为了解决数据稀缺的共同问题,我们提出了一种简单的Transformer模型,可以在合理的数据量上进行训练,这是HTR公共数据集的一般情况,无需外部数据。第二,它可以通过课程学习策略在页面级别学习阅读顺序,避免线分割错误,利用更大的上下文,减少昂贵的分割注释需求。第三,它可以轻松适应其他脚本,通过仅使用页面级别标记图像的应用简单的迁移学习过程。对不同脚本不同数据集(法语、英语、西班牙语和阿拉伯语)进行广泛的实验表明,该模型的有效性。

URL

https://arxiv.org/abs/2303.13931

PDF

https://arxiv.org/pdf/2303.13931.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot