Paper Reading AI Learner

CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models

2024-08-30 17:26:05
Jonathan Bourne

Abstract

The digitisation of historical print media archives is crucial for increasing accessibility to contemporary records. However, the process of Optical Character Recognition (OCR) used to convert physical records to digital text is prone to errors, particularly in the case of newspapers and periodicals due to their complex layouts. This paper introduces Context Leveraging OCR Correction (CLOCR-C), which utilises the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality. The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing the socio-cultural context as part of the correction process. Experiments were conducted using seven LMs on three datasets: the 19th Century Serials Edition (NCSE) and two datasets from the Overproof collection. The results demonstrate that some LMs can significantly reduce error rates, with the top-performing model achieving over a 60% reduction in character error rate on the NCSE dataset. The OCR improvements extend to downstream tasks, such as Named Entity Recognition, with increased Cosine Named Entity Similarity. Furthermore, the study shows that providing socio-cultural context in the prompts improves performance, while misleading prompts lower performance. In addition to the findings, this study releases a dataset of 91 transcribed articles from the NCSE, containing a total of 40 thousand words, to support further research in this area. The findings suggest that CLOCR-C is a promising approach for enhancing the quality of existing digital archives by leveraging the socio-cultural information embedded in the LMs and the text requiring correction.

Abstract (translated)

历史印刷媒体档案的数字化对于提高当代记录的可用性至关重要。然而,将物理记录转换为数字文本的光学字符识别(OCR)过程容易出错,特别是在报纸和期刊等复杂排版的情况下。本文介绍了一种名为Context Leveraging OCR Correction(CLOCR-C)的方法,它利用了基于Transformer的语言模型(LMs)的填充和上下文自适应能力来提高OCR质量。研究旨在确定LMs是否能在OCR后进行更正,提高下游自然语言处理(NLP)任务的效果,以及提供社会文化背景作为更正过程的一部分的价值。实验在三个数据集上进行:19世纪期刊系列(NCSE)和Overproof收藏中的两个数据集。结果表明,一些LMS可以显著降低错误率,最高性能的模型在NCSE数据集上的字符错误率降低了60%以上。OCR改进也延伸到下游任务,如命名实体识别,增强的余弦命名实体相似性。此外,研究还发现,在提示中提供社会文化背景可以提高性能,而误导性的提示会降低性能。除了这些发现之外,这项研究还释放了一个包含91篇从NCSE收集的转录文章的91篇文章的数据集,共计40,000个单词,以支持该领域进一步的研究。这些发现表明,CLOCR-C是通过利用LMs中嵌入的社会文化信息和需要更正的文本来提高现有数字档案质量的有前途的方法。

URL

https://arxiv.org/abs/2408.17428

PDF

https://arxiv.org/pdf/2408.17428.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot