Paper Reading AI Learner

Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin

2018-09-14 16:52:12
Uwe Springmann, Christian Reul, Stefanie Dipper, Johannes Baiter

Abstract

In this paper we describe a dataset of German and Latin \textit{ground truth} (GT) for historical OCR in the form of printed text line images paired with their transcription. This dataset, called \textit{GT4HistOCR}, consists of 313,173 line pairs covering a wide period of printing dates from incunabula from the 15th century to 19th century books printed in Fraktur types and is openly available under a CC-BY 4.0 license. The special form of GT as line image/transcription pairs makes it directly usable to train state-of-the-art recognition models for OCR software employing recurring neural networks in LSTM architecture such as Tesseract 4 or OCRopus. We also provide some pretrained OCRopus models for subcorpora of our dataset yielding between 95\% (early printings) and 98\% (19th century Fraktur printings) character accuracy rates on unseen test cases, a Perl script to harmonize GT produced by different transcription rules, and give hints on how to construct GT for OCR purposes which has requirements that may differ from linguistically motivated transcriptions.

Abstract (translated)

在本文中,我们描述了德语和拉丁语\ textit {ground truth}(GT)的数据集,用于历史OCR,其​​印刷文本行图像与其转录配对。这个名为\ textit {GT4HistOCR}的数据集由313,173个线对组成,涵盖了从Fraktur类型打印的15世纪至19世纪书籍中的大量印刷日期,并且可以在CC-BY 4.0许可下公开获得。作为线图像/转录对的GT的特殊形式使其可直接用于训练OST软件的最先进识别模型,该模型采用LSTM架构中的重复神经网络,例如Tesseract 4或OCRopus。我们还为我们的数据集提供了一些预训练的OCRopus模型,在不可见的测试用例中产生95%(早期印刷)和98%(19世纪Fraktur印刷)字符准确率,这是一个用于协调不同转录规则生成的GT的Perl脚本,并提供有关如何为OCR目的构建GT的提示,其要求可能与语言动机的转录不同。

URL

https://arxiv.org/abs/1809.05501

PDF

https://arxiv.org/pdf/1809.05501.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot