The widespread usage of cars and other large, heavy vehicles necessitates the development of an effective parking infrastructure. Additionally, algorithms for detection and recognition of number plates are often used to identify automobiles all around the world where standardized plate sizes and fonts are enforced, making recognition an effortless task. As a result, both kinds of data can be combined to develop an intelligent parking system focuses on the technology of Automatic Number Plate Recognition (ANPR). Retrieving characters from an inputted number plate image is the sole purpose of ANPR which is a costly procedure. In this article, we propose Chaurah, a minimal cost ANPR system that relies on a Raspberry Pi 3 that was specifically created for parking facilities. The system employs a dual-stage methodology, with the first stage being an ANPR system which makes use of two convolutional neural networks (CNNs). The primary locates and recognises license plates from a vehicle image, while the secondary performs Optical Character Recognition (OCR) to identify individualized numbers from the number plate. An application built with Flutter and Firebase for database administration and license plate record comparison makes up the second component of the overall solution. The application also acts as an user-interface for the billing mechanism based on parking time duration resulting in an all-encompassing software deployment of the study.
汽车和其他大型重型交通工具的广泛使用催生了有效的停车基础设施的发展。此外,通常使用算法来检测和识别车牌号码,以识别世界各地标准化车牌尺寸和字体的汽车,这使得识别变得轻松。因此,可以将这两种数据结合以开发专注于自动车牌识别技术(ANPR)的智能停车系统。从输入的車牌圖像中檢索字符是ANPR的唯一目的,而這是一個昂貴的過程。在本文中,我們提議Chaurah,一個最小成本的ANPR系統,該系統依賴於專門為停車場設計的Raspberry Pi 3。該系統采用雙階段方法,第一階段是使用兩個卷積神經網絡(CNN)的ANPR系統。主要從車輛圖像中檢測和識別車牌,而第二階段對車牌進行光學字符識別(OCR)以識別車牌上的個人化號碼。由Flutter和Firebase編寫的用於數據庫管理和車牌記錄比對的應用程序是整個解決方案的第二個組件。該應用程序還充当用戶界面,用於計費機制,這使得研究部署的全面性更加廣泛。
https://arxiv.org/abs/2312.16894
Text segmentation, the task of dividing a document into sections, is often a prerequisite for performing additional natural language processing tasks. Existing text segmentation methods have typically been developed and tested using clean, narrative-style text with segments containing distinct topics. Here we consider a challenging text segmentation task: dividing newspaper marriage announcement lists into units of one announcement each. In many cases the information is not structured into sentences, and adjacent segments are not topically distinct from each other. In addition, the text of the announcements, which is derived from images of historical newspapers via optical character recognition, contains many typographical errors. As a result, these announcements are not amenable to segmentation with existing techniques. We present a novel deep learning-based model for segmenting such text and show that it significantly outperforms an existing state-of-the-art method on our task.
文本分割,即将文档划分为段落,通常是对执行其他自然语言处理任务的一个先决条件。现有的文本分割方法通常使用干净、叙述性风格的文本,其中包含有明确主题的段落。在这里我们考虑一个具有挑战性的文本分割任务:将报纸结婚声明列表分割为每个声明的单位。在许多情况下,信息并没有划分为句子,相邻的段落也没有从属关系。此外,声明的文本是通过光学字符识别从历史报纸中提取的,因此包含许多排版错误。因此,这些声明无法使用现有技术进行分割。我们提出了一个基于深度学习的分割文本的新模型,并证明了它在我们的任务上显著超过了现有技术的水平。
https://arxiv.org/abs/2312.12773
Optical character recognition (OCR) is a vital process that involves the extraction of handwritten or printed text from scanned or printed images, converting it into a format that can be understood and processed by machines. This enables further data processing activities such as searching and editing. The automatic extraction of text through OCR plays a crucial role in digitizing documents, enhancing productivity, improving accessibility, and preserving historical records. This paper seeks to offer an exhaustive review of contemporary applications, methodologies, and challenges associated with Arabic Optical Character Recognition (OCR). A thorough analysis is conducted on prevailing techniques utilized throughout the OCR process, with a dedicated effort to discern the most efficacious approaches that demonstrate enhanced outcomes. To ensure a thorough evaluation, a meticulous keyword-search methodology is adopted, encompassing a comprehensive analysis of articles relevant to Arabic OCR, including both backward and forward citation reviews. In addition to presenting cutting-edge techniques and methods, this paper critically identifies research gaps within the realm of Arabic OCR. By highlighting these gaps, we shed light on potential areas for future exploration and development, thereby guiding researchers toward promising avenues in the field of Arabic OCR. The outcomes of this study provide valuable insights for researchers, practitioners, and stakeholders involved in Arabic OCR, ultimately fostering advancements in the field and facilitating the creation of more accurate and efficient OCR systems for the Arabic language.
光学字符识别(OCR)是一个关键的过程,涉及从扫描或打印图像中提取手写或印刷文本,并将其转换为可以被机器理解和处理的形式。这使得进一步的数据处理活动成为可能,例如搜索和编辑。通过OCR自动提取文本在数字化文档、提高生产率、改善可访问性以及保护历史记录中发挥了关键作用。本文旨在对当代阿拉伯语OCR应用程序、方法和挑战进行全面回顾。对OCR过程中采用的现有技术的深入分析,包括对与阿拉伯语OCR相关的文章的全面回顾,以确保彻底评估。此外,本文重点关注研究领域的空白,从而引导研究人员朝着该领域有前景的方向发展。本研究的结果为阿拉伯语OCR的研究人员、实践者和利益相关者提供了宝贵的洞见,从而推动了该领域的发展,并为阿拉伯语言创建更准确、更有效的OCR系统奠定了基础。
https://arxiv.org/abs/2312.11812
In recent years, the optical character recognition (OCR) field has been proliferating with plentiful cutting-edge approaches for a wide spectrum of tasks. However, these approaches are task-specifically designed with divergent paradigms, architectures, and training strategies, which significantly increases the complexity of research and maintenance and hinders the fast deployment in applications. To this end, we propose UPOCR, a simple-yet-effective generalist model for Unified Pixel-level OCR interface. Specifically, the UPOCR unifies the paradigm of diverse OCR tasks as image-to-image transformation and the architecture as a vision Transformer (ViT)-based encoder-decoder. Learnable task prompts are introduced to push the general feature representations extracted by the encoder toward task-specific spaces, endowing the decoder with task awareness. Moreover, the model training is uniformly aimed at minimizing the discrepancy between the generated and ground-truth images regardless of the inhomogeneity among tasks. Experiments are conducted on three pixel-level OCR tasks including text removal, text segmentation, and tampered text detection. Without bells and whistles, the experimental results showcase that the proposed method can simultaneously achieve state-of-the-art performance on three tasks with a unified single model, which provides valuable strategies and insights for future research on generalist OCR models. Code will be publicly available.
近年来,光学字符识别(OCR)领域已经出现了许多丰富的新方法,用于各种任务。然而,这些方法都是针对特定任务设计的具有分叉范式、架构和训练策略的,这使得研究和维护变得复杂,并阻碍了在应用中快速部署。为此,我们提出了UPOCR,一种简单而有效的统一像素级OCR接口通用模型。具体来说,UPOCR将多样OCR任务的范式统一为图像到图像变换,将架构统一为基于视觉Transformer(ViT)的编码器-解码器。引入了可学习任务提示,将编码器提取的通用特征表示推向任务特定空间,使解码器具有任务意识。此外,模型训练目标是在不同任务间保持生成图像和真实图像之间的差异最小化。实验对包括文本去除、文本分割和篡改文本检测在内的三个像素级OCR任务进行了研究。没有花言巧语,实验结果表明,所提出的统一单模型可以同时实现三个任务的最先进性能,该模型为未来研究提供了有价值的策略和见解。代码将公开发布。
https://arxiv.org/abs/2312.02694
Optical Character Recognition is a technique that converts document images into searchable and editable text, making it a valuable tool for processing scanned documents. While the Farsi language stands as a prominent and official language in Asia, efforts to develop efficient methods for recognizing Farsi printed text have been relatively limited. This is primarily attributed to the languages distinctive features, such as cursive form, the resemblance between certain alphabet characters, and the presence of numerous diacritics and dot placement. On the other hand, given the substantial training sample requirements of deep-based architectures for effective performance, the development of such datasets holds paramount significance. In light of these concerns, this paper aims to present a novel large-scale dataset, IDPL-PFOD2, tailored for Farsi printed text recognition. The dataset comprises 2003541 images featuring a wide variety of fonts, styles, and sizes. This dataset is an extension of the previously introduced IDPL-PFOD dataset, offering a substantial increase in both volume and diversity. Furthermore, the datasets effectiveness is assessed through the utilization of both CRNN-based and Vision Transformer architectures. The CRNN-based model achieves a baseline accuracy rate of 78.49% and a normalized edit distance of 97.72%, while the Vision Transformer architecture attains an accuracy of 81.32% and a normalized edit distance of 98.74%.
光学字符识别是一种将文档图像转换为可搜索和可编辑文本的技术,使其成为处理扫描文档的有价值的工具。虽然波斯语作为一种突出和官方的语言亚洲具有突出地位,但开发有效的识别波斯语印刷文本的方法相对有限。这主要归因于其独特的特征,如手写形式、某些字母之间的相似性以及大量的小写和点状排列。另一方面,由于基于深度架构的架构对有效性能的训练样本需求很大,开发这样的数据集具有关键意义。鉴于这些担忧,本文旨在介绍一个专为波斯语印刷文本识别而设计的大型数据集——IDPL-PFOD2。该数据集包括2003541张具有各种字体、风格和大小的图像。这个数据集是之前介绍的IDPL-PFOD数据集的扩展,提供了极大的数据量和多样性。此外,通过使用CRNN和Vision Transformer架构对数据集的有效性进行评估。基于CRNN的模型实现基线准确率为78.49%,归一化编辑距离为97.72%,而基于Vision Transformer的架构实现准确率为81.32%,归一化编辑距离为98.74%。
https://arxiv.org/abs/2312.01177
Recent advancements in Optical Character Recognition (OCR) have been driven by transformer-based models. OCR systems are critical in numerous high-stakes domains, yet their vulnerability to adversarial attack remains largely uncharted territory, raising concerns about security and compliance with emerging AI regulations. In this work we present a novel framework to assess the resilience of Transformer-based OCR (TrOCR) models. We develop and assess algorithms for both targeted and untargeted attacks. For the untargeted case, we measure the Character Error Rate (CER), while for the targeted case we use the success ratio. We find that TrOCR is highly vulnerable to untargeted attacks and somewhat less vulnerable to targeted attacks. On a benchmark handwriting data set, untargeted attacks can cause a CER of more than 1 without being noticeable to the eye. With a similar perturbation size, targeted attacks can lead to success rates of around $25\%$ -- here we attacked single tokens, requiring TrOCR to output the tenth most likely token from a large vocabulary.
近年来,在自然语言处理(NLP)领域,特别是基于Transformer的模型在光学字符识别(OCR)方面的进步取得了重要突破。OCR系统在许多高风险领域至关重要,但它们对攻击的抵抗力仍然是一个未被充分探索的领域,这引发了关于安全和人工智能法规遵守方面的担忧。在这项工作中,我们提出了一个评估Transformer-based OCR(TrOCR)模型韧性的新框架。我们开发并评估了针对目标和无目标攻击的算法。对于无目标攻击,我们测量字符误差率(CER),而对于有目标攻击,我们使用成功率。我们发现,TrOCR在无目标攻击上高度脆弱,在有目标攻击上 somewhat less vulnerable。在一个人工手写数据集上进行评估,无目标攻击可能导致超过1的CER,而目标攻击可能导致约25%的成功率。在这里,我们攻击单个标记,要求TrOCR从大型词汇中输出最有可能的标记。
https://arxiv.org/abs/2311.17128
This paper introduces a novel approach to post-Optical Character Recognition Correction (POC) for handwritten Cyrillic text, addressing a significant gap in current research methodologies. This gap is due to the lack of large text corporas that provide OCR errors for further training of language-based POC models, which are demanding in terms of corpora size. Our study primarily focuses on the development and application of a synthetic handwriting generation engine based on Bézier curves. Such an engine generates highly realistic handwritten text in any amounts, which we utilize to create a substantial dataset by transforming Russian text corpora sourced from the internet. We apply a Handwritten Text Recognition (HTR) model to this dataset to identify OCR errors, forming the basis for our POC model training. The correction model is trained on a 90-symbol input context, utilizing a pre-trained T5 architecture with a seq2seq correction task. We evaluate our approach on HWR200 and School_notebooks_RU datasets as they provide significant challenges in the HTR domain. Furthermore, POC can be used to highlight errors for teachers, evaluating student performance. This can be done simply by comparing sentences before and after correction, displaying differences in text. Our primary contribution lies in the innovative use of Bézier curves for Cyrillic text generation and subsequent error correction using a specialized POC model. We validate our approach by presenting Word Accuracy Rate (WAR) and Character Accuracy Rate (CAR) results, both with and without post-OCR correction, using real open corporas of handwritten Cyrillic text. These results, coupled with our methodology, are designed to be reproducible, paving the way for further advancements in the field of OCR and handwritten text analysis. Paper contributions can be found in this https URL
本文提出了一种新的手写体识别纠正方法,用于解决当前研究方法论中存在的显著空白。这一空白是因为缺乏大型的文本语料库,这些语料库提供了进一步训练基于语言的POC模型的错误信息。我们的研究主要关注基于Bézier曲线的手写体生成引擎的开发和应用。这种引擎可以生成任意数量高度逼真的手写文本,我们利用这个生成的大量俄语文本语料库来创建一个庞大的数据集。我们将手写文本识别(HTR)模型应用于这个数据集,以识别OCR错误,为POC模型训练奠定基础。修正模型在90个符号输入上下文中训练,利用预训练的T5架构和序列2序列修正任务。我们在HWR200和School_notebooks_RU数据集上评估我们的方法,因为这些数据集在HTR领域存在重大挑战。此外,POC可以用于教师评估学生表现。这可以通过比较修复前后的句子来简单地完成,显示文本中的差异。我们主要的贡献在于创新地使用Bézier曲线生成手写体和利用专用POC模型进行错误纠正。我们通过展示Word Accuracy Rate(WAR)和Character Accuracy Rate(CAR)结果,包括修复前和修复后的结果,使用真实的手写体拉丁文语料库进行验证。这些结果与我们的方法结合在一起,旨在为OCR和手写体分析领域带来进一步的进步。您可以在该链接找到论文贡献:https://url.cnki.net/ after-correction
https://arxiv.org/abs/2311.15896
Linked Data is used in various fields as a new way of structuring and connecting data. Cultural heritage institutions have been using linked data to improve archival descriptions and facilitate the discovery of information. Most archival records have digital representations of physical artifacts in the form of scanned images that are non-machine-readable. Optical Character Recognition (OCR) recognizes text in images and translates it into machine-encoded text. This paper evaluates the impact of image processing methods and parameter tuning in OCR applied to typewritten cultural heritage documents. The approach uses a multi-objective problem formulation to minimize Levenshtein edit distance and maximize the number of words correctly identified with a non-dominated sorting genetic algorithm (NSGA-II) to tune the methods' parameters. Evaluation results show that parameterization by digital representation typology benefits the performance of image pre-processing algorithms in OCR. Furthermore, our findings suggest that employing image pre-processing algorithms in OCR might be more suitable for typologies where the text recognition task without pre-processing does not produce good results. In particular, Adaptive Thresholding, Bilateral Filter, and Opening are the best-performing algorithms for the theatre plays' covers, letters, and overall dataset, respectively, and should be applied before OCR to improve its performance.
作为一种新的数据组织和连接数据的方法,链接数据在各种领域得到了广泛应用。文化遗产机构已经使用链接数据来改善档案馆描述并促进信息的发现。大多数档案馆记录的数字形式是扫描图像,这些图像无法被机器阅读。光学字符识别(OCR)识别图像中的文本并将其转换为机器编码文本。本文评估了应用于手写文化遗产文档的图像处理方法和参数调整对OCR的影响。该方法使用多目标问题求解来最小化Levenshtein编辑距离并最大化非支配排序遗传算法(NSGA-II)正确识别非支配排序单词的数量,以调整方法参数。评估结果显示,通过数字表示类型学对参数进行调整可以提高OCR前处理算法的性能。此外,我们的研究结果表明,在OCR中应用图像预处理算法可能更适用于那些没有预处理文本识别任务产生良好结果的字体。特别是,Adaptive Thresholding、Bilateral Filter和Opening是剧院剧本封面、信件和整体数据集的最佳表现算法,应在与OCR一起应用前进行改善其性能。
https://arxiv.org/abs/2311.15740
Iris recognition systems, operating in the near infrared spectrum (NIR), have demonstrated vulnerability to presentation attacks, where an adversary uses artifacts such as cosmetic contact lenses, artificial eyes or printed iris images in order to circumvent the system. At the same time, a number of effective presentation attack detection (PAD) methods have been developed. These methods have demonstrated success in detecting artificial eyes (e.g., fake Van Dyke eyes) as presentation attacks. In this work, we seek to alter the optical characteristics of artificial eyes by affixing Vanadium Dioxide (VO2) films on their surface in various spatial configurations. VO2 films can be used to selectively transmit NIR light and can, therefore, be used to regulate the amount of NIR light from the object that is captured by the iris sensor. We study the impact of such images produced by the sensor on two state-of-the-art iris PA detection methods. We observe that the addition of VO2 films on the surface of artificial eyes can cause the PA detection methods to misclassify them as bonafide eyes in some cases. This represents a vulnerability that must be systematically analyzed and effectively addressed.
近红外(NIR)识别系统操作在近红外频段(NIR)中,已经证明了易受展示攻击的漏洞,在这种攻击中,攻击者使用 cosmetic 接触镜、人工眼睛或打印的虹膜图像等物品绕过系统。同时,已经开发了许多有效的展示攻击检测(PAD)方法。这些方法在检测人工眼睛(例如,假 Van Dyke 眼睛)作为展示攻击方面取得了成功。在这项工作中,我们试图通过在人工眼睛表面粘贴 Vanadium Dioxide(VO2) films 来改变人工眼睛的光学特性。VO2 films 可以专门传输 NIR 光,因此可以用来调节被瞳孔传感器捕获的对象的 NIR 光量。我们研究了这种传感器产生的图像对两种最先进的虹膜PA检测方法的影响。我们观察到,在某些情况下,在人工眼睛表面粘贴 VO2 films 会使得 PAD 检测方法将其误判为真正的眼睛。这表示必须系统地分析并有效地解决这个漏洞。
https://arxiv.org/abs/2311.12773
Existing visual parsers for molecule diagrams translate pixel-based raster images such as PNGs to chemical structure representations (e.g., SMILES). However, PDFs created by word processors including \LaTeX{} and Word provide explicit locations and shapes for characters, lines, and polygons. We %introduce a method to extract symbols from born-digital PDF molecule images and then apply simple graph transformations to capture both visual and chemical structure in editable ChemDraw files (CDXML). Our fast ( PDF $\rightarrow$ visual graph $\rightarrow$ chemical graph ) pipeline does not require GPUs, Optical Character Recognition (OCR) or vectorization. We evaluate on standard benchmarks using SMILES strings, along with a novel evaluation that provides graph-based metrics and error compilation using LgEval. The geometric information in born-digital PDFs produces a highly accurate parser, motivating generating training data for visual parsers that recognize from raster images, with extracted graphics, visual structure, and chemical structure as annotations. To do this we render SMILES strings in Indigo, parse molecule structure, and then validate recognized structure to select correct files.
现有的分子图翻译器将像素基于光栅图像(如PNG)翻译为化学结构表示(例如,SMILES)。然而,由文字处理器生成的PDF提供了字符、线和多边形的明确位置和形状。我们 %引入了一种从出生数字PDF分子图像中提取符号的方法,然后对可编辑的ChemDraw文件(CDXML)应用简单图形变换来捕捉视觉和化学结构。我们的快速(PDF $\rightarrow$ 视觉图形 $\rightarrow$ 化学图形)流程不需要GPU,光学字符识别(OCR)或向量化。我们使用SMILES字符串在标准基准测试上进行评估,同时还使用一种新的评估方法,该方法提供了基于图的指标和利用LgEval进行错误编译。分子数字PDF中的几何信息产生了一个高度准确的解析器,从而 motivation为从光栅图像中识别的视觉解析器生成训练数据,并将提取的图形、视觉结构和化学结构作为注释。为此,我们在Indigo中渲染SMILES字符串,解析分子结构,然后验证识别的结构以选择正确的文件。
https://arxiv.org/abs/2311.12161
The performance of optical character recognition (OCR) heavily relies on document image quality, which is crucial for automatic document processing and document intelligence. However, most existing document enhancement methods require supervised data pairs, which raises concerns about data separation and privacy protection, and makes it challenging to adapt these methods to new domain pairs. To address these issues, we propose DECDM, an end-to-end document-level image translation method inspired by recent advances in diffusion models. Our method overcomes the limitations of paired training by independently training the source (noisy input) and target (clean output) models, making it possible to apply domain-specific diffusion models to other pairs. DECDM trains on one dataset at a time, eliminating the need to scan both datasets concurrently, and effectively preserving data privacy from the source or target domain. We also introduce simple data augmentation strategies to improve character-glyph conservation during translation. We compare DECDM with state-of-the-art methods on multiple synthetic data and benchmark datasets, such as document denoising and {\color{black}shadow} removal, and demonstrate the superiority of performance quantitatively and qualitatively.
OCR性能的优劣很大程度上取决于文档图像的质量,这对自动文档处理和文档智能至关重要。然而,现有的文档增强方法通常需要监督数据对,这引发了数据分离和隐私保护的问题,使得将 these 方法适应新的领域对具有挑战性。为了应对这些问题,我们提出了 DECDM,一种基于近年来扩散模型进展的端到端文档级别图像转换方法。我们的方法通过独立训练源(带噪音的输入)和目标(干净输出)模型,克服了配对训练的局限性,使得可以在其他配对上应用领域特定的扩散模型。DECDM逐一站在一个数据集中训练,消除了同时扫描两个数据集的需要,有效保护了源或目标域中的数据隐私。我们还引入了简单的数据增强策略,以改善翻译过程中字符 glyph 的保留。我们在多个合成数据和基准数据集上与最先进的方法进行了比较,如文档去噪和 {\color{black}阴影} 消除,并展示了性能的定量定性比较和定性比较。
https://arxiv.org/abs/2311.09625
Understanding visually situated language requires recognizing text and visual elements, and interpreting complex layouts. State-of-the-art methods commonly use specialized pre-processing tools, such as optical character recognition (OCR) systems, that map document image inputs to extracted information in the space of textual tokens, and sometimes also employ large language models (LLMs) to reason in text token space. However, the gains from external tools and LLMs come at the cost of increased computational and engineering complexity. In this paper, we ask whether small pretrained image-to-text models can learn selective text or layout recognition and reasoning as an intermediate inference step in an end-to-end model for pixel-level visual language understanding. We incorporate the outputs of such OCR tools, LLMs, and larger multimodal models as intermediate ``rationales'' on training data, and train a small student model to predict both rationales and answers for input questions based on those training examples. A student model based on Pix2Struct (282M parameters) achieves consistent improvements on three visual document understanding benchmarks representing infographics, scanned documents, and figures, with improvements of more than 4\% absolute over a comparable Pix2Struct model that predicts answers directly.
理解视觉本位语言需要识别文本和视觉元素,并解释复杂布局。最先进的方法通常使用专门的预处理工具,如光学字符识别(OCR)系统,将文档图像输入映射到文本上下文中的提取信息,并且有时还使用大型语言模型(LLMs)进行文本词空间推理。然而,从外部工具和LLM获得的收益是以增加计算和工程复杂性为代价的。在本文中,我们询问是否小型的预训练图像到文本模型可以在端到端像素级视觉语言理解的 intermediate inference 步骤中学习选择性的文本或布局识别和推理。我们在训练数据上使用这些 OCR 工具、LLM 和更大多模态模型的输出作为中间 "推理",并训练了一个小学生模型,根据这些训练示例预测输入问题的推理和答案。基于Pix2Struct(282M参数)的学生模型在代表信息图、扫描文档和图形的三个视觉文档理解基准测试中实现了持续的改进,相对改进超过4%的绝对值。
https://arxiv.org/abs/2311.09612
This paper introduces the off-road motorcycle Racer number Dataset (RnD), a new challenging dataset for optical character recognition (OCR) research. RnD contains 2,411 images from professional motorsports photographers that depict motorcycle racers in off-road competitions. The images exhibit a wide variety of factors that make OCR difficult, including mud occlusions, motion blur, non-standard fonts, glare, complex backgrounds, etc. The dataset has 5,578 manually annotated bounding boxes around visible motorcycle numbers, along with transcribed digits and letters. Our experiments benchmark leading OCR algorithms and reveal an end-to-end F1 score of only 0.527 on RnD, even after fine-tuning. Analysis of performance on different occlusion types shows mud as the primary challenge, degrading accuracy substantially compared to normal conditions. But the models struggle with other factors including glare, blur, shadows, and dust. Analysis exposes substantial room for improvement and highlights failure cases of existing models. RnD represents a valuable new benchmark to drive innovation in real-world OCR capabilities. The authors hope the community will build upon this dataset and baseline experiments to make progress on the open problem of robustly recognizing text in unconstrained natural environments. The dataset is available at this https URL.
本文介绍了一个新的具有挑战性的摩托车赛车手数据集(RnD),为光学字符识别(OCR)研究提供一个全新的难题。RnD包含了2,411张来自专业摩托车摄影师的专业摩托车赛车手在赛道比赛中的照片。这些照片展示了使OCR困难的各种因素,包括泥泞遮挡、运动模糊、非标准的字体、炫光、复杂的背景等。数据集周围有5,578个手动标注的可见摩托车号码的边界框,以及转录的数字和字母。我们对数据集的性能进行了基准测试,并发现即使在精调后,端到端F1得分也只有0.527。分析不同遮挡类型的性能显示,泥泞是主要挑战,导致准确性大幅下降。但是,模型在包括炫光、模糊、阴影和灰尘等其他因素上表现不佳。分析揭示了很大的改进空间,并突出了现有模型的失败案例。RnD代表了一个有价值的新的挑战,以推动在现实世界中实现OCR能力的创新。作者希望社区将这个数据集和基准实验建立在之上,以解决在不受约束的自然环境中准确识别文本的开放性问题。该数据集的URL为https:// this URL.
https://arxiv.org/abs/2311.09256
This paper introduces DONUT-hole, a sparse OCR-free visual document understanding (VDU) model that addresses the limitations of its predecessor model, dubbed DONUT. The DONUT model, leveraging a transformer architecture, overcoming the challenges of separate optical character recognition (OCR) and visual semantic understanding (VSU) components. However, its deployment in production environments and edge devices is hindered by high memory and computational demands, particularly in large-scale request services. To overcome these challenges, we propose an optimization strategy based on knowledge distillation and model pruning. Our paradigm to produce DONUT-hole, reduces the model denisty by 54\% while preserving performance. We also achieve a global representational similarity index between DONUT and DONUT-hole based on centered kernel alignment (CKA) metric of 0.79. Moreover, we evaluate the effectiveness of DONUT-hole in the document image key information extraction (KIE) task, highlighting its potential for developing more efficient VDU systems for logistic companies.
本文介绍了一种名为DONUT-hole的稀疏OCR-免费视觉文档理解(VDU)模型,它克服了其前身的局限性,前名为DONUT。DONUT模型利用Transformer架构,克服了光学字符识别(OCR)和视觉语义理解(VSU)组件的挑战。然而,其在生产环境和边缘设备上的部署受到高内存和计算需求的限制,特别是在大规模请求服务中。为了克服这些挑战,我们提出了基于知识蒸馏和模型剪裁的优化策略。我们的基于中心卷积对齐(CKA)指标的范式将模型的密度减少了54\%。此外,我们还通过CKA指标评估了DONUT-hole在文档图像关键信息提取(KIE)任务中的有效性,强调了其在开发更高效的VDU系统方面的潜力,尤其是在对面向逻辑公司的公司。
https://arxiv.org/abs/2311.05778
Diffusion models have gained attention for image editing yielding impressive results in text-to-image tasks. On the downside, one might notice that generated images of stable diffusion models suffer from deteriorated details. This pitfall impacts image editing tasks that require information preservation e.g., scene text editing. As a desired result, the model must show the capability to replace the text on the source image to the target text while preserving the details e.g., color, font size, and background. To leverage the potential of diffusion models, in this work, we introduce Diffusion-BasEd Scene Text manipulation Network so-called DBEST. Specifically, we design two adaptation strategies, namely one-shot style adaptation and text-recognition guidance. In experiments, we thoroughly assess and compare our proposed method against state-of-the-arts on various scene text datasets, then provide extensive ablation studies for each granularity to analyze our performance gain. Also, we demonstrate the effectiveness of our proposed method to synthesize scene text indicated by competitive Optical Character Recognition (OCR) accuracy. Our method achieves 94.15% and 98.12% on COCO-text and ICDAR2013 datasets for character-level evaluation.
扩散模型因在文本到图像任务中产生令人印象深刻的结果而受到了关注。然而,一个缺点是,生成的稳定扩散模型的图像在细节上会恶化。这个缺陷会影响需要保留信息的图像编辑任务,例如场景文本编辑。作为理想的结果,模型必须表现出在保留细节的同时将源图像上的文本替换为目标文本的能力,例如颜色、字体大小和背景。为了利用扩散模型的潜力,在本文中,我们引入了扩散-基于场景文本编辑网络DBEST。具体来说,我们设计了两项适应策略,即一次风格适应和文本识别指导。在实验中,我们深入评估和比较了我们的方法与现有技术的 various场景文本数据集上的表现,然后对每个粒度进行广泛的消融研究,以分析我们性能的提高。此外,我们还证明了我们方法的有效性,用于生成与竞争光学字符识别(OCR)准确性相同的场景文本。我们的方法在COCO-text和ICDAR2013数据集上的字符级别评估获得了94.15%和98.12%的准确率。
https://arxiv.org/abs/2311.00734
This paper presents a comprehensive evaluation of the Optical Character Recognition (OCR) capabilities of the recently released GPT-4V(ision), a Large Multimodal Model (LMM). We assess the model's performance across a range of OCR tasks, including scene text recognition, handwritten text recognition, handwritten mathematical expression recognition, table structure recognition, and information extraction from visually-rich document. The evaluation reveals that GPT-4V performs well in recognizing and understanding Latin contents, but struggles with multilingual scenarios and complex tasks. Based on these observations, we delve deeper into the necessity of specialized OCR models and deliberate on the strategies to fully harness the pretrained general LMMs like GPT-4V for OCR downstream tasks. The study offers a critical reference for future research in OCR with LMMs. Evaluation pipeline and results are available at this https URL.
本文对GPT-4V(Vision),一个 recently发布的Large Multimodal Model(LMM)进行了Optical Character Recognition(OCR)能力进行全面评估。我们在一系列OCR任务中评估了模型的性能,包括场景文本识别、手写文本识别、手写数学表达式识别、表结构识别和从视觉丰富的文档中提取信息。评估显示,GPT-4V在识别和理解拉丁文本方面表现良好,但在多语言场景和复杂任务上表现不佳。基于这些观察结果,我们深入研究了专用OCR模型的必要性,并考虑了如何充分利用预训练的一般LMMs(如GPT-4V)进行OCR下游任务的策略。该研究为未来的OCR研究提供了重要的参考。评估流程和结果可在此链接查看:https://url.cn/xyz6h4yx
https://arxiv.org/abs/2310.16809
Key information extraction (KIE) from scanned documents has gained increasing attention because of its applications in various domains. Although promising results have been achieved by some recent KIE approaches, they are usually built based on discriminative models, which lack the ability to handle optical character recognition (OCR) errors and require laborious token-level labelling. In this paper, we propose a novel generative end-to-end model, named GenKIE, to address the KIE task. GenKIE is a sequence-to-sequence multimodal generative model that utilizes multimodal encoders to embed visual, layout and textual features and a decoder to generate the desired output. Well-designed prompts are leveraged to incorporate the label semantics as the weakly supervised signals and entice the generation of the key information. One notable advantage of the generative model is that it enables automatic correction of OCR errors. Besides, token-level granular annotation is not required. Extensive experiments on multiple public real-world datasets show that GenKIE effectively generalizes over different types of documents and achieves state-of-the-art results. Our experiments also validate the model's robustness against OCR errors, making GenKIE highly applicable in real-world scenarios.
从扫描文档中提取关键信息(KIE)因其在各种领域的应用而受到越来越多的关注。虽然一些最近KIE方法取得了很好的结果,但它们通常基于有监督的模型构建,缺乏处理光学字符识别(OCR)错误的能力,需要进行费力的标记级token级标注。在本文中,我们提出了一个名为GenKIE的新序列到序列多模态生成模型,以解决KIE任务。GenKIE是一种序列到序列多模态生成模型,利用多模态编码器嵌入视觉、布局和文本特征,并使用解码器生成所需输出。通过设计合适的提示,充分利用标签语义作为弱监督信号,激发关键信息的生成。 生成模型的一个显著优点是,它能够自动纠正OCR错误。此外,不需要进行标记级token级标注。在多个公开的实世界数据集上的广泛实验表明,GenKIE在各种类型的文档上有效扩展,并实现了最先进的性能。我们的实验还验证了模型的鲁棒性 against OCR错误,使GenKIE在现实场景中具有很高的适用性。
https://arxiv.org/abs/2310.16131
For a financial analyst, the question and answer (Q\&A) segment of the company financial report is a crucial piece of information for various analysis and investment decisions. However, extracting valuable insights from the Q\&A section has posed considerable challenges as the conventional methods such as detailed reading and note-taking lack scalability and are susceptible to human errors, and Optical Character Recognition (OCR) and similar techniques encounter difficulties in accurately processing unstructured transcript text, often missing subtle linguistic nuances that drive investor decisions. Here, we demonstrate the utilization of Large Language Models (LLMs) to efficiently and rapidly extract information from earnings report transcripts while ensuring high accuracy transforming the extraction process as well as reducing hallucination by combining retrieval-augmented generation technique as well as metadata. We evaluate the outcomes of various LLMs with and without using our proposed approach based on various objective metrics for evaluating Q\&A systems, and empirically demonstrate superiority of our method.
对于财务分析师来说,公司财务报告中的问题与答案(Q&A)部分是进行各种分析和投资决策的关键信息。然而,从Q&A部分提取有价值的信息面临着相当大的挑战,因为传统的方法,如详细阅读和记笔记,缺乏可扩展性,容易出错,而光学字符识别(OCR)等类似技术在准确处理非结构化文本方面也存在困难,往往遗漏了投资者决策所需的细微语言细微差别。在这里,我们展示了使用大型语言模型(LLMs)从收益报告录音中高效且快速提取信息,同时确保高准确性的方法,以及结合检索增强生成技术和元数据降低伪影的方法。我们根据各种客观指标评估了各种LLM的性能,并实证证明了我们的方法的优越性。
https://arxiv.org/abs/2310.10760
Billions of public domain documents remain trapped in hard copy or lack an accurate digitization. Modern natural language processing methods cannot be used to index, retrieve, and summarize their texts; conduct computational textual analyses; or extract information for statistical analyses, and these texts cannot be incorporated into language model training. Given the diversity and sheer quantity of public domain texts, liberating them at scale requires optical character recognition (OCR) that is accurate, extremely cheap to deploy, and sample-efficient to customize to novel collections, languages, and character sets. Existing OCR engines, largely designed for small-scale commercial applications in high resource languages, often fall short of these requirements. EffOCR (EfficientOCR), a novel open-source OCR package, meets both the computational and sample efficiency requirements for liberating texts at scale by abandoning the sequence-to-sequence architecture typically used for OCR, which takes representations from a learned vision model as inputs to a learned language model. Instead, EffOCR models OCR as a character or word-level image retrieval problem. EffOCR is cheap and sample efficient to train, as the model only needs to learn characters' visual appearance and not how they are used in sequence to form language. Models in the EffOCR model zoo can be deployed off-the-shelf with only a few lines of code. Importantly, EffOCR also allows for easy, sample efficient customization with a simple model training interface and minimal labeling requirements due to its sample efficiency. We illustrate the utility of EffOCR by cheaply and accurately digitizing 20 million historical U.S. newspaper scans, evaluating zero-shot performance on randomly selected documents from the U.S. National Archives, and accurately digitizing Japanese documents for which all other OCR solutions failed.
数百万份公共领域的文档仍然被困在纸质文件或缺乏准确的数字化中。现代自然语言处理方法无法对它们进行索引、检索和摘要,进行计算性文本分析,或提取统计分析所需的资料,这些文本也无法纳入语言模型训练。考虑到公共领域文本的多样性,要在规模上解放它们,需要准确、成本低廉且针对新型收藏、语言和字符集自定义的光学字符识别(OCR)。现有的OCR引擎,主要针对用于高资源语言的小规模商业应用,往往无法满足这些要求。EffOCR是一个新型的开源OCR软件包,它满足了在规模上解放文本的光学字符识别(OCR)的计算和样本效率要求,摒弃了通常用于OCR的序列到序列架构,将输入从预训练的视觉模型转换为预训练的语言模型。相反,EffOCR将OCR建模为字符或单词级的图像检索问题。EffOCR训练起来既便宜又样本 efficient,因为模型只需要学习字符的视觉外观,而不需要了解它们在序列中的用法。EffOCR模型动物园中的模型可以用几行代码轻松部署。重要的是,EffOCR还允许通过简单的模型训练界面实现轻松、样本 efficient的自定义,并且由于其样本效率,不需要大量的标注。我们通过低成本且准确地数字化了2000万份美国历史报纸扫描,对美国国家档案馆随机选择的文档进行了零散射击性能评估,以及为其他OCR解决方案未能准确数字化的日本文档准确地进行数字化,证明了EffOCR的实用性。
https://arxiv.org/abs/2310.10050
Text-based visual question answering (TextVQA) faces the significant challenge of avoiding redundant relational inference. To be specific, a large number of detected objects and optical character recognition (OCR) tokens result in rich visual relationships. Existing works take all visual relationships into account for answer prediction. However, there are three observations: (1) a single subject in the images can be easily detected as multiple objects with distinct bounding boxes (considered repetitive objects). The associations between these repetitive objects are superfluous for answer reasoning; (2) two spatially distant OCR tokens detected in the image frequently have weak semantic dependencies for answer reasoning; and (3) the co-existence of nearby objects and tokens may be indicative of important visual cues for predicting answers. Rather than utilizing all of them for answer prediction, we make an effort to identify the most important connections or eliminate redundant ones. We propose a sparse spatial graph network (SSGN) that introduces a spatially aware relation pruning technique to this task. As spatial factors for relation measurement, we employ spatial distance, geometric dimension, overlap area, and DIoU for spatially aware pruning. We consider three visual relationships for graph learning: object-object, OCR-OCR tokens, and object-OCR token relationships. SSGN is a progressive graph learning architecture that verifies the pivotal relations in the correlated object-token sparse graph, and then in the respective object-based sparse graph and token-based sparse graph. Experiment results on TextVQA and ST-VQA datasets demonstrate that SSGN achieves promising performances. And some visualization results further demonstrate the interpretability of our method.
基于文本的视觉问题回答(TextVQA)面临着避免冗余关系推断的重要挑战。具体来说,大量检测到的物体和光学字符识别(OCR)标记会导致丰富的视觉关系。现有的工作将所有视觉关系都考虑在内来预测答案。然而,有三个观察结果:(1)图像中的单个主题很容易被认为是有多个具有不同边界框的重复物体(被视为重复物体);这些重复物之间的关联对于答案推理来说毫无价值;(2)在图像中检测到的距离较远的 OCR 标记通常在答案推理中具有弱的语义依赖性;(3)附近物体和标记的共现可能表明预测答案的重要视觉线索。因此,我们没有将所有这些信息都用于答案预测,而是努力识别最重要的连接或消除冗余的连接。我们提出了一个稀疏空间图网络(SSGN),它引入了一种空间感知关系剪枝技术来解决这个问题。作为关系测量的空间因素,我们使用空间距离、几何维度、重叠面积和 DIoU来进行空间感知剪枝。我们考虑三种图形关系进行图学习:物体-物体,OCR-OCR 标记和物体-OCR 标记关系。SSGN 是一种渐进式图学习架构,验证了相关物体-标记稀疏图中的关键关系,然后在每个物体基础稀疏图和标记基础稀疏图上。TextVQA 和 ST-VQA 数据集的实验结果表明,SSGN 取得了很好的性能。一些可视化结果进一步证明了我们的方法具有可解释性。
https://arxiv.org/abs/2310.09147