Abstract
Handwritten text recognition has been developed rapidly in the recent years, following the rise of deep learning and its applications. Though deep learning methods provide notable boost in performance concerning text recognition, non-trivial deviation in performance can be detected even when small pre-processing or architectural/optimization elements are changed. This work follows a ``best practice'' rationale; highlight simple yet effective empirical practices that can further help training and provide well-performing handwritten text recognition systems. Specifically, we considered three basic aspects of a deep HTR system and we proposed simple yet effective solutions: 1) retain the aspect ratio of the images in the preprocessing step, 2) use max-pooling for converting the 3D feature map of CNN output into a sequence of features and 3) assist the training procedure via an additional CTC loss which acts as a shortcut on the max-pooled sequential features. Using these proposed simple modifications, one can attain close to state-of-the-art results, while considering a basic convolutional-recurrent (CNN+LSTM) architecture, for both IAM and RIMES datasets. Code is available at this https URL.
Abstract (translated)
手写文本识别在过去几年中发展迅速,随着深度学习和其应用的增长。尽管深度学习方法在文本识别方面的表现有显著的提高,但即使是在预处理或架构/优化元素改变时,也可以检测到非寻常的性能偏差。本文遵循了一个“最佳实践”的原则;强调简单的 yet effective 的实证实践,可以帮助训练并提供高性能的手写文本识别系统。具体来说,我们考虑了深度 HTR 系统的基本方面,并提出了 simple yet effective 的解决方案:1)保留图像预处理阶段的 aspect ratio,2)使用 max-pooling 将 CNN 输出 3D 特征图转换为特征序列,3)通过额外的 CTC 损失来辅助训练,该损失在 max-pooled 序列特征上起短路作用。利用这些提出的简单修改,可以达到与最先进水平相当的结果,同时考虑基础卷积循环(CNN+LSTM)架构,对于 IAM 和 RIMES 数据集。代码可在此链接下载。
URL
https://arxiv.org/abs/2404.11339