Abstract
Handling large corpuses of documents is of significant importance in many fields, no more so than in the areas of crime investigation and defence, where an organisation may be presented with a large volume of scanned documents which need to be processed in a finite time. However, this problem is exacerbated both by the volume, in terms of scanned documents and the complexity of the pages, which need to be processed. Often containing many different elements, which each need to be processed and understood. Text recognition, which is a primary task of this process, is usually dependent upon the type of text, being either handwritten or machine-printed. Accordingly, the recognition involves prior classification of the text category, before deciding on the recognition method to be applied. This poses a more challenging task if a document contains both handwritten and machine-printed text. In this work, we present a generic process flow for text recognition in scanned documents containing mixed handwritten and machine-printed text without the need to classify text in advance. We realize the proposed process flow using several open-source image processing and text recognition packages1. The evaluation is performed using a specially developed variant, presented in this work, of the IAM handwriting database, where we achieve an average transcription accuracy of nearly 80% for pages containing both printed and handwritten text.
Abstract (translated)
处理大量文件在许多领域都具有重要意义,但不比在犯罪调查和辩护领域更为重要,在这些领域中,组织可能会收到大量需要在有限时间内处理的扫描文件。然而,就扫描文档的数量和需要处理的页面的复杂性而言,这一问题因数量而加剧。通常包含许多不同的元素,每个元素都需要被处理和理解。文本识别是这个过程的主要任务,通常取决于文本的类型,可以是手写的,也可以是机器打印的。因此,在确定要应用的识别方法之前,识别涉及文本类别的预先分类。如果文档同时包含手写文本和机器打印文本,则这将是一项更具挑战性的任务。在这项工作中,我们提出了一个通用的过程流程,在扫描的文件中,包含混合手写和机器打印的文本,而不需要预先分类文本。我们使用多个开源图像处理和文本识别软件包1来实现所提出的流程。评估是使用一个特别开发的变种,在这项工作中,介绍了IAM手写数据库,我们实现了平均抄写精度接近80%的页面包含打印和手写文本。
URL
https://arxiv.org/abs/1904.12387