Data Generation for Post-OCR correction of Cyrillic handwriting

Abstract
Abstract (translated)
URL
PDF

Abstract

This paper introduces a novel approach to post-Optical Character Recognition Correction (POC) for handwritten Cyrillic text, addressing a significant gap in current research methodologies. This gap is due to the lack of large text corporas that provide OCR errors for further training of language-based POC models, which are demanding in terms of corpora size. Our study primarily focuses on the development and application of a synthetic handwriting generation engine based on Bézier curves. Such an engine generates highly realistic handwritten text in any amounts, which we utilize to create a substantial dataset by transforming Russian text corpora sourced from the internet. We apply a Handwritten Text Recognition (HTR) model to this dataset to identify OCR errors, forming the basis for our POC model training. The correction model is trained on a 90-symbol input context, utilizing a pre-trained T5 architecture with a seq2seq correction task. We evaluate our approach on HWR200 and School_notebooks_RU datasets as they provide significant challenges in the HTR domain. Furthermore, POC can be used to highlight errors for teachers, evaluating student performance. This can be done simply by comparing sentences before and after correction, displaying differences in text. Our primary contribution lies in the innovative use of Bézier curves for Cyrillic text generation and subsequent error correction using a specialized POC model. We validate our approach by presenting Word Accuracy Rate (WAR) and Character Accuracy Rate (CAR) results, both with and without post-OCR correction, using real open corporas of handwritten Cyrillic text. These results, coupled with our methodology, are designed to be reproducible, paving the way for further advancements in the field of OCR and handwritten text analysis. Paper contributions can be found in this https URL

Abstract (translated)

本文提出了一种新的手写体识别纠正方法，用于解决当前研究方法论中存在的显著空白。这一空白是因为缺乏大型的文本语料库，这些语料库提供了进一步训练基于语言的POC模型的错误信息。我们的研究主要关注基于Bézier曲线的手写体生成引擎的开发和应用。这种引擎可以生成任意数量高度逼真的手写文本，我们利用这个生成的大量俄语文本语料库来创建一个庞大的数据集。我们将手写文本识别（HTR）模型应用于这个数据集，以识别OCR错误，为POC模型训练奠定基础。修正模型在90个符号输入上下文中训练，利用预训练的T5架构和序列2序列修正任务。我们在HWR200和School_notebooks_RU数据集上评估我们的方法，因为这些数据集在HTR领域存在重大挑战。此外，POC可以用于教师评估学生表现。这可以通过比较修复前后的句子来简单地完成，显示文本中的差异。我们主要的贡献在于创新地使用Bézier曲线生成手写体和利用专用POC模型进行错误纠正。我们通过展示Word Accuracy Rate（WAR）和Character Accuracy Rate（CAR）结果，包括修复前和修复后的结果，使用真实的手写体拉丁文语料库进行验证。这些结果与我们的方法结合在一起，旨在为OCR和手写体分析领域带来进一步的进步。您可以在该链接找到论文贡献：https://url.cnki.net/ after-correction

URL

https://arxiv.org/abs/2311.15896

PDF

https://arxiv.org/pdf/2311.15896.pdf

Data Generation for Post-OCR correction of Cyrillic handwriting

Abstract

Abstract (translated)

URL

PDF Copy

PDF