Abstract
We implemented a high-performance optical character recognition model for classical handwritten documents using data augmentation with highly variable cropping within the document region. Optical character recognition in handwritten documents, especially classical documents, has been a challenging topic in many countries and research organizations due to its difficulty. Although many researchers have conducted research on this topic, the quality of classical texts over time and the unique stylistic characteristics of various authors have made it difficult, and it is clear that the recognition of hanja handwritten documents is a meaningful and special challenge, especially since hanja, which has been developed by reflecting the vocabulary, semantic, and syntactic features of the Joseon Dynasty, is different from classical Chinese characters. To study this challenge, we used 1100 cursive documents, which are small in size, and augmented 100 documents per document by cropping a randomly sized region within each document for training, and trained them using a two-stage object detection model, High resolution neural network (HRNet), and applied the resulting model to achieve a high inference recognition rate of 90% for cursive documents. Through this study, we also confirmed that the performance of OCR is affected by the simplified characters, variants, variant characters, common characters, and alternators of Chinese characters that are difficult to see in other studies, and we propose that the results of this study can be applied to optical character recognition of modern documents in multiple languages as well as other typefaces in classical documents.
Abstract (translated)
我们通过在文档区域内使用高度变化的裁剪进行数据增强,实现了一个高性能的手写光学字符识别模型,专门用于古典手写文件。手写文档中的光学字符识别,特别是古典文档,一直是许多国家和研究机构面临的难题。尽管许多研究人员已经在这个话题上进行了研究,但由于时间对古典文本质量的影响以及不同作者的独特风格特征,这个问题变得尤为困难。尤其是对于由朝鲜王朝的词汇、语义和句法特点发展而来的汉字手写文档识别来说,这是一项有意义且特殊的挑战,因为汉字与传统的汉字有所不同。为了应对这一挑战,我们使用了1100份小尺寸的草书文件,并通过随机裁剪每个文档内的区域来增强每份文档至100份进行训练。采用两阶段对象检测模型和高分辨率神经网络(HRNet)对其进行训练,并将该模型应用于实现对草书文档90%的推理识别率。通过这项研究,我们也确认了OCR性能受到汉字简化字、异体字、变体字、常用字以及替换字的影响,这些在其他研究中难以观察到的特点。我们建议,本研究的结果可以用于多种语言的现代文档光学字符识别以及其他类型的古典文档中。
URL
https://arxiv.org/abs/2412.10647