Abstract
The digitisation of historical print media archives is crucial for increasing accessibility to contemporary records. However, the process of Optical Character Recognition (OCR) used to convert physical records to digital text is prone to errors, particularly in the case of newspapers and periodicals due to their complex layouts. This paper introduces Context Leveraging OCR Correction (CLOCR-C), which utilises the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality. The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing the socio-cultural context as part of the correction process. Experiments were conducted using seven LMs on three datasets: the 19th Century Serials Edition (NCSE) and two datasets from the Overproof collection. The results demonstrate that some LMs can significantly reduce error rates, with the top-performing model achieving over a 60% reduction in character error rate on the NCSE dataset. The OCR improvements extend to downstream tasks, such as Named Entity Recognition, with increased Cosine Named Entity Similarity. Furthermore, the study shows that providing socio-cultural context in the prompts improves performance, while misleading prompts lower performance. In addition to the findings, this study releases a dataset of 91 transcribed articles from the NCSE, containing a total of 40 thousand words, to support further research in this area. The findings suggest that CLOCR-C is a promising approach for enhancing the quality of existing digital archives by leveraging the socio-cultural information embedded in the LMs and the text requiring correction.
Abstract (translated)
历史印刷媒体档案的数字化对于提高当代记录的可用性至关重要。然而,将物理记录转换为数字文本的光学字符识别(OCR)过程容易出错,特别是在报纸和期刊等复杂排版的情况下。本文介绍了一种名为Context Leveraging OCR Correction(CLOCR-C)的方法,它利用了基于Transformer的语言模型(LMs)的填充和上下文自适应能力来提高OCR质量。研究旨在确定LMs是否能在OCR后进行更正,提高下游自然语言处理(NLP)任务的效果,以及提供社会文化背景作为更正过程的一部分的价值。实验在三个数据集上进行:19世纪期刊系列(NCSE)和Overproof收藏中的两个数据集。结果表明,一些LMS可以显著降低错误率,最高性能的模型在NCSE数据集上的字符错误率降低了60%以上。OCR改进也延伸到下游任务,如命名实体识别,增强的余弦命名实体相似性。此外,研究还发现,在提示中提供社会文化背景可以提高性能,而误导性的提示会降低性能。除了这些发现之外,这项研究还释放了一个包含91篇从NCSE收集的转录文章的91篇文章的数据集,共计40,000个单词,以支持该领域进一步的研究。这些发现表明,CLOCR-C是通过利用LMs中嵌入的社会文化信息和需要更正的文本来提高现有数字档案质量的有前途的方法。
URL
https://arxiv.org/abs/2408.17428