Paper Reading AI Learner

A Cost Efficient Approach to Correct OCR Errors in Large Document Collections

2019-05-28 11:11:57
Deepayan Das, Jerin Philip, Minesh Mathew, C. V. Jawahar

Abstract

Word error rate of an ocr is often higher than its character error rate. This is especially true when ocrs are designed by recognizing characters. High word accuracies are critical to tasks like the creation of content in digital libraries and text-to-speech applications. In order to detect and correct the misrecognised words, it is common for an ocr module to employ a post-processor to further improve the word accuracy. However, conventional approaches to post-processing like looking up a dictionary or using a statistical language model (slm), are still limited. In many such scenarios, it is often required to remove the outstanding errors manually. We observe that the traditional post-processing schemes look at error words sequentially since ocrs process documents one at a time. We propose a cost-efficient model to address the error words in batches rather than correcting them individually. We exploit the fact that a collection of documents, unlike a single document, has a structure leading to repetition of words. Such words, if efficiently grouped together and corrected as a whole can lead to a significant reduction in the cost. Correction can be fully automatic or with a human in the loop. Towards this, we employ a novel clustering scheme to obtain fairly homogeneous clusters. We compare the performance of our model with various baseline approaches including the case where all the errors are removed by a human. We demonstrate the efficacy of our solution empirically by reporting more than 70% reduction in the human effort with near perfect error correction. We validate our method on Books from multiple languages.

Abstract (translated)

OCR的字错误率通常高于其字符错误率。当通过识别字符来设计OCR时尤其如此。高字精度对于数字图书馆和文本到语音应用程序中的内容创建等任务至关重要。为了检测和纠正错误识别的单词,OCR模块通常采用后置处理器来进一步提高单词的准确性。然而,传统的后处理方法,如查找字典或使用统计语言模型(SLM),仍然有限。在许多这样的场景中,通常需要手动删除未处理的错误。我们观察到,传统的后处理方案顺序地查看错误词,因为OCRS一次处理一个错误词。我们提出了一个低成本的模型来批量处理错误字,而不是单独纠正错误字。我们利用这样一个事实:文档集合不像单个文档,具有导致重复单词的结构。这样的话,如果有效地组合在一起,并作为一个整体加以纠正,可能会大大降低成本。校正可以是全自动的,也可以是人在回路中。为此,我们采用了一种新的聚类方案来获得相当均匀的聚类。我们将模型的性能与各种基线方法进行比较,其中包括人类消除所有错误的情况。我们通过报告在接近完美的误差校正下人类努力减少了70%以上,从经验上证明了我们的解决方案的有效性。我们在多种语言的书籍上验证我们的方法。

URL

https://arxiv.org/abs/1905.11739

PDF

https://arxiv.org/pdf/1905.11739.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot