Paper Reading AI Learner

Best Practices for a Handwritten Text Recognition System

2024-04-17 13:00:05
George Retsinas, Giorgos Sfikas, Basilis Gatos, Christophoros Nikou

Abstract

Handwritten text recognition has been developed rapidly in the recent years, following the rise of deep learning and its applications. Though deep learning methods provide notable boost in performance concerning text recognition, non-trivial deviation in performance can be detected even when small pre-processing or architectural/optimization elements are changed. This work follows a ``best practice'' rationale; highlight simple yet effective empirical practices that can further help training and provide well-performing handwritten text recognition systems. Specifically, we considered three basic aspects of a deep HTR system and we proposed simple yet effective solutions: 1) retain the aspect ratio of the images in the preprocessing step, 2) use max-pooling for converting the 3D feature map of CNN output into a sequence of features and 3) assist the training procedure via an additional CTC loss which acts as a shortcut on the max-pooled sequential features. Using these proposed simple modifications, one can attain close to state-of-the-art results, while considering a basic convolutional-recurrent (CNN+LSTM) architecture, for both IAM and RIMES datasets. Code is available at this https URL.

Abstract (translated)

手写文本识别在过去几年中发展迅速,随着深度学习和其应用的增长。尽管深度学习方法在文本识别方面的表现有显著的提高,但即使是在预处理或架构/优化元素改变时,也可以检测到非寻常的性能偏差。本文遵循了一个“最佳实践”的原则;强调简单的 yet effective 的实证实践,可以帮助训练并提供高性能的手写文本识别系统。具体来说,我们考虑了深度 HTR 系统的基本方面,并提出了 simple yet effective 的解决方案:1)保留图像预处理阶段的 aspect ratio,2)使用 max-pooling 将 CNN 输出 3D 特征图转换为特征序列,3)通过额外的 CTC 损失来辅助训练,该损失在 max-pooled 序列特征上起短路作用。利用这些提出的简单修改,可以达到与最先进水平相当的结果,同时考虑基础卷积循环(CNN+LSTM)架构,对于 IAM 和 RIMES 数据集。代码可在此链接下载。

URL

https://arxiv.org/abs/2404.11339

PDF

https://arxiv.org/pdf/2404.11339.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot