EfficientOCR: An Extensible, Open-Source Package for Efficiently Digitizing World Knowledge

Abstract
Abstract (translated)
URL
PDF

Abstract

Billions of public domain documents remain trapped in hard copy or lack an accurate digitization. Modern natural language processing methods cannot be used to index, retrieve, and summarize their texts; conduct computational textual analyses; or extract information for statistical analyses, and these texts cannot be incorporated into language model training. Given the diversity and sheer quantity of public domain texts, liberating them at scale requires optical character recognition (OCR) that is accurate, extremely cheap to deploy, and sample-efficient to customize to novel collections, languages, and character sets. Existing OCR engines, largely designed for small-scale commercial applications in high resource languages, often fall short of these requirements. EffOCR (EfficientOCR), a novel open-source OCR package, meets both the computational and sample efficiency requirements for liberating texts at scale by abandoning the sequence-to-sequence architecture typically used for OCR, which takes representations from a learned vision model as inputs to a learned language model. Instead, EffOCR models OCR as a character or word-level image retrieval problem. EffOCR is cheap and sample efficient to train, as the model only needs to learn characters' visual appearance and not how they are used in sequence to form language. Models in the EffOCR model zoo can be deployed off-the-shelf with only a few lines of code. Importantly, EffOCR also allows for easy, sample efficient customization with a simple model training interface and minimal labeling requirements due to its sample efficiency. We illustrate the utility of EffOCR by cheaply and accurately digitizing 20 million historical U.S. newspaper scans, evaluating zero-shot performance on randomly selected documents from the U.S. National Archives, and accurately digitizing Japanese documents for which all other OCR solutions failed.

Abstract (translated)

数百万份公共领域的文档仍然被困在纸质文件或缺乏准确的数字化中。现代自然语言处理方法无法对它们进行索引、检索和摘要，进行计算性文本分析，或提取统计分析所需的资料，这些文本也无法纳入语言模型训练。考虑到公共领域文本的多样性，要在规模上解放它们，需要准确、成本低廉且针对新型收藏、语言和字符集自定义的光学字符识别（OCR）。现有的OCR引擎，主要针对用于高资源语言的小规模商业应用，往往无法满足这些要求。EffOCR是一个新型的开源OCR软件包，它满足了在规模上解放文本的光学字符识别（OCR）的计算和样本效率要求，摒弃了通常用于OCR的序列到序列架构，将输入从预训练的视觉模型转换为预训练的语言模型。相反，EffOCR将OCR建模为字符或单词级的图像检索问题。EffOCR训练起来既便宜又样本 efficient，因为模型只需要学习字符的视觉外观，而不需要了解它们在序列中的用法。EffOCR模型动物园中的模型可以用几行代码轻松部署。重要的是，EffOCR还允许通过简单的模型训练界面实现轻松、样本 efficient的自定义，并且由于其样本效率，不需要大量的标注。我们通过低成本且准确地数字化了2000万份美国历史报纸扫描，对美国国家档案馆随机选择的文档进行了零散射击性能评估，以及为其他OCR解决方案未能准确数字化的日本文档准确地进行数字化，证明了EffOCR的实用性。

URL

https://arxiv.org/abs/2310.10050

PDF

https://arxiv.org/pdf/2310.10050.pdf

EfficientOCR: An Extensible, Open-Source Package for Efficiently Digitizing World Knowledge

Abstract

Abstract (translated)

URL

PDF Copy

PDF