Abstract
In this paper we describe a dataset of German and Latin \textit{ground truth} (GT) for historical OCR in the form of printed text line images paired with their transcription. This dataset, called \textit{GT4HistOCR}, consists of 313,173 line pairs covering a wide period of printing dates from incunabula from the 15th century to 19th century books printed in Fraktur types and is openly available under a CC-BY 4.0 license. The special form of GT as line image/transcription pairs makes it directly usable to train state-of-the-art recognition models for OCR software employing recurring neural networks in LSTM architecture such as Tesseract 4 or OCRopus. We also provide some pretrained OCRopus models for subcorpora of our dataset yielding between 95\% (early printings) and 98\% (19th century Fraktur printings) character accuracy rates on unseen test cases, a Perl script to harmonize GT produced by different transcription rules, and give hints on how to construct GT for OCR purposes which has requirements that may differ from linguistically motivated transcriptions.
Abstract (translated)
在本文中,我们描述了德语和拉丁语\ textit {ground truth}(GT)的数据集,用于历史OCR,其印刷文本行图像与其转录配对。这个名为\ textit {GT4HistOCR}的数据集由313,173个线对组成,涵盖了从Fraktur类型打印的15世纪至19世纪书籍中的大量印刷日期,并且可以在CC-BY 4.0许可下公开获得。作为线图像/转录对的GT的特殊形式使其可直接用于训练OST软件的最先进识别模型,该模型采用LSTM架构中的重复神经网络,例如Tesseract 4或OCRopus。我们还为我们的数据集提供了一些预训练的OCRopus模型,在不可见的测试用例中产生95%(早期印刷)和98%(19世纪Fraktur印刷)字符准确率,这是一个用于协调不同转录规则生成的GT的Perl脚本,并提供有关如何为OCR目的构建GT的提示,其要求可能与语言动机的转录不同。
URL
https://arxiv.org/abs/1809.05501