Abstract
Offline Handwritten Text Recognition (HTR) systems play a crucial role in applications such as historical document digitization, automatic form processing, and biometric authentication. However, their performance is often hindered by the limited availability of annotated training data, particularly for low-resource languages and complex scripts. This paper presents a comprehensive survey of offline handwritten data augmentation and generation techniques designed to improve the accuracy and robustness of HTR systems. We systematically examine traditional augmentation methods alongside recent advances in deep learning, including Generative Adversarial Networks (GANs), diffusion models, and transformer-based approaches. Furthermore, we explore the challenges associated with generating diverse and realistic handwriting samples, particularly in preserving script authenticity and addressing data scarcity. This survey follows the PRISMA methodology, ensuring a structured and rigorous selection process. Our analysis began with 1,302 primary studies, which were filtered down to 848 after removing duplicates, drawing from key academic sources such as IEEE Digital Library, Springer Link, Science Direct, and ACM Digital Library. By evaluating existing datasets, assessment metrics, and state-of-the-art methodologies, this survey identifies key research gaps and proposes future directions to advance the field of handwritten text generation across diverse linguistic and stylistic landscapes.
Abstract (translated)
离线手写文本识别(HTR)系统在历史文档数字化、自动表单处理和生物特征认证等应用中扮演着重要角色。然而,其性能往往受到注释训练数据可用性有限的限制,尤其是对于资源匮乏的语言和复杂的书写系统。本文综述了用于提高HTR系统准确性和鲁棒性的离线手写数据增强和生成技术。我们系统地考察了传统增广方法以及深度学习领域的最新进展,包括生成对抗网络(GANs)、扩散模型和基于变压器的方法。此外,还探讨了生成多样且现实的手写样本的挑战,特别是保持书写真实性及应对数据稀缺问题。本次综述遵循PRISMA方法论,确保了一个结构化和严谨的选择过程。我们的分析始于1,302项初步研究,并在去除重复后筛选至848项,这些研究主要来源于IEEE数字图书馆、Springer Link、Science Direct和ACM数字图书馆等关键学术来源。通过评估现有的数据集、评价指标及前沿方法论,本综述指出了关键的研究缺口,并提出了未来发展方向,以推进跨越多样化语言和书写风格的手写文本生成领域的进步。
URL
https://arxiv.org/abs/2507.06275