Abstract
Linked Data is used in various fields as a new way of structuring and connecting data. Cultural heritage institutions have been using linked data to improve archival descriptions and facilitate the discovery of information. Most archival records have digital representations of physical artifacts in the form of scanned images that are non-machine-readable. Optical Character Recognition (OCR) recognizes text in images and translates it into machine-encoded text. This paper evaluates the impact of image processing methods and parameter tuning in OCR applied to typewritten cultural heritage documents. The approach uses a multi-objective problem formulation to minimize Levenshtein edit distance and maximize the number of words correctly identified with a non-dominated sorting genetic algorithm (NSGA-II) to tune the methods' parameters. Evaluation results show that parameterization by digital representation typology benefits the performance of image pre-processing algorithms in OCR. Furthermore, our findings suggest that employing image pre-processing algorithms in OCR might be more suitable for typologies where the text recognition task without pre-processing does not produce good results. In particular, Adaptive Thresholding, Bilateral Filter, and Opening are the best-performing algorithms for the theatre plays' covers, letters, and overall dataset, respectively, and should be applied before OCR to improve its performance.
Abstract (translated)
作为一种新的数据组织和连接数据的方法,链接数据在各种领域得到了广泛应用。文化遗产机构已经使用链接数据来改善档案馆描述并促进信息的发现。大多数档案馆记录的数字形式是扫描图像,这些图像无法被机器阅读。光学字符识别(OCR)识别图像中的文本并将其转换为机器编码文本。本文评估了应用于手写文化遗产文档的图像处理方法和参数调整对OCR的影响。该方法使用多目标问题求解来最小化Levenshtein编辑距离并最大化非支配排序遗传算法(NSGA-II)正确识别非支配排序单词的数量,以调整方法参数。评估结果显示,通过数字表示类型学对参数进行调整可以提高OCR前处理算法的性能。此外,我们的研究结果表明,在OCR中应用图像预处理算法可能更适用于那些没有预处理文本识别任务产生良好结果的字体。特别是,Adaptive Thresholding、Bilateral Filter和Opening是剧院剧本封面、信件和整体数据集的最佳表现算法,应在与OCR一起应用前进行改善其性能。
URL
https://arxiv.org/abs/2311.15740