Despite significant progress in optical character recognition (OCR) and computer vision systems, robustly recognizing text and identifying people in images taken in unconstrained \emph{in-the-wild} environments remain an ongoing challenge. However, such obstacles must be overcome in practical applications of vision systems, such as identifying racers in photos taken during off-road racing events. To this end, we introduce two new challenging real-world datasets - the off-road motorcycle Racer Number Dataset (RND) and the Muddy Racer re-iDentification Dataset (MUDD) - to highlight the shortcomings of current methods and drive advances in OCR and person re-identification (ReID) under extreme conditions. These two datasets feature over 6,300 images taken during off-road competitions which exhibit a variety of factors that undermine even modern vision systems, namely mud, complex poses, and motion blur. We establish benchmark performance on both datasets using state-of-the-art models. Off-the-shelf models transfer poorly, reaching only 15% end-to-end (E2E) F1 score on text spotting, and 33% rank-1 accuracy on ReID. Fine-tuning yields major improvements, bringing model performance to 53% F1 score for E2E text spotting and 79% rank-1 accuracy on ReID, but still falls short of good performance. Our analysis exposes open problems in real-world OCR and ReID that necessitate domain-targeted techniques. With these datasets and analysis of model limitations, we aim to foster innovations in handling real-world conditions like mud and complex poses to drive progress in robust computer vision. All data was sourced from this http URL, a website used by professional motorsports photographers, racers, and fans. The top-performing text spotting and ReID models are deployed on this platform to power real-time race photo search.
尽管在光学字符识别(OCR)和计算机视觉系统方面取得了显著的进展,但在不受约束的野外环境中准确识别文本和识别人物仍然是一个持续的挑战。然而,在视觉系统的实际应用中, such 障碍必须被克服,例如在赛车照片中识别赛车手。为此,我们引入了两个新的具有挑战性的现实世界数据集——赛车手编号数据集(RND)和泥泞赛车手重新识别数据集(MUDD),以强调在极端条件下 OCR 和人物识别(ReID)方法的不足之处,推动在不受约束的环境中实现更好的识别性能。这两个数据集涵盖了在赛车比赛中拍摄的超过 6,300 张图像,这些图像呈现出各种因素,对即使是最先进的现代视觉系统也会产生影响,例如泥、复杂的姿势和运动模糊。我们在两个数据集上使用最先进的模型进行基准性能评估。通用的模型转移差,仅达到15%的端到端(E2E) F1 分数在文本检测中,而在 ReID 方面也只有33%的排名一准确率。微调带来重大改进,将模型的性能提高到53%的E2E文本检测和79%的排名一准确率在 ReID 上,但仍然存在不足。我们的分析揭示了在现实世界 OCR 和 ReID 中需要解决的问题,这需要领域特定的技术。有了这些数据集和模型限制的分析,我们旨在推动在处理类似泥和复杂姿态的实时情况方面的创新,以推动计算机视觉在实时情况下的进步。所有数据都来自这个链接,这是一个专业赛车摄影师、赛车手和粉丝使用的网站。在这个平台上,最优秀的文本检测和 ReID 模型部署用于实时赛车照片搜索。
https://arxiv.org/abs/2402.08025
Captcha are widely used to secure systems from automatic responses by distinguishing computer responses from human responses. Text, audio, video, picture picture-based Optical Character Recognition (OCR) are used for creating captcha. Text-based OCR captcha are the most often used captcha which faces issues namely, complex and distorted contents. There are attempts to build captcha detection and classification-based systems using machine learning and neural networks, which need to be tuned for accuracy. The existing systems face challenges in the recognition of distorted characters, handling variable-length captcha and finding sequential dependencies in captcha. In this work, we propose a segmentation-free OCR model for text captcha classification based on the connectionist temporal classification loss technique. The proposed model is trained and tested on a publicly available captcha dataset. The proposed model gives 99.80\% character level accuracy, while 95\% word level accuracy. The accuracy of the proposed model is compared with the state-of-the-art models and proves to be effective. The variable length complex captcha can be thus processed with the segmentation-free connectionist temporal classification loss technique with dependencies which will be massively used in securing the software systems.
Captcha 广泛用于通过区分计算机响应与人类响应来保护系统免受自动回复。文本、音频、视频和基于图像的图像识别(OCR)用于创建 captcha。基于文本的 OCR captcha 是最常见的 captcha,它面临复杂和扭曲的内容的问题。尝试使用机器学习和神经网络基于 captcha 检测和分类构建系统,这些系统需要进行精度调整。现有的系统在识别扭曲的 characters、处理变长 captcha 和寻找序列依赖方面面临挑战。在本文中,我们提出了一种基于连接主义时间分类损失技术文本captcha分类的分割免费 OCR 模型。所提出的模型在公开可用的 captcha 数据集上进行训练和测试。所提出的模型在字符级别具有 99.80\% 的准确率,而在单词级别具有 95\% 的准确率。与最先进的模型进行比较证明效果显著。因此,可以使用基于分割免费连接主义时间分类损失技术处理变长复杂 captcha。
https://arxiv.org/abs/2402.05417
In this work, product tables in invoices are obtained autonomously via a deep learning model, which is named as ExTTNet. Firstly, text is obtained from invoice images using Optical Character Recognition (OCR) techniques. Tesseract OCR engine [37] is used for this process. Afterwards, the number of existing features is increased by using feature extraction methods to increase the accuracy. Labeling process is done according to whether each text obtained as a result of OCR is a table element or not. In this study, a multilayer artificial neural network model is used. The training has been carried out with an Nvidia RTX 3090 graphics card and taken $162$ minutes. As a result of the training, the F1 score is $0.92$.
在这项工作中,通过一个名为ExTTNet的深度学习模型,在发票中自动获取产品表是通过深度学习模型获得的。首先,使用光字符识别(OCR)技术从发票图像中获取文本。 Tesseract OCR 引擎 [37] 用于这个过程。然后,通过使用特征提取方法增加现有特征的数量来提高准确性。根据提取的 OCR 结果文本是否为表元素进行标签处理。在这项研究中,使用了多层人工神经网络模型。训练过程使用Nvidia RTX 3090图形芯片进行,用时162分钟。训练后,F1得分达到了0.92。
https://arxiv.org/abs/2402.02246
Despite the impressive capabilities of Multimodal Large Language Models (MLLMs) in integrating text and image modalities, challenges remain in accurately interpreting detailed visual elements. This paper presents an empirical study on enhancing MLLMs with state-of-the-art (SOTA) object detection and Optical Character Recognition models to improve fine-grained image understanding and reduce hallucination in responses. Our research investigates the embedding-based infusion of detection information, the impact of such infusion on the MLLMs' original abilities, and the interchangeability of detection models. We conduct systematic experiments with models such as LLaVA-1.5, DINO, and PaddleOCRv2, revealing that our approach not only refines MLLMs' performance in specific visual tasks but also maintains their original strengths. The resulting enhanced MLLMs outperform SOTA models on 9 out of 10 benchmarks, achieving an improvement of up to 12.99% on the normalized average score, marking a notable advancement in multimodal understanding. We release our codes to facilitate further exploration into the fine-grained multimodal dialogue capabilities of MLLMs.
尽管多模态大型语言模型(MLLMs)在整合文本和图像模态方面具有令人印象深刻的 capabilities,但准确解释详细视觉元素仍然具有挑战性。本文进行了一项关于通过最先进的(SOTA)目标检测和光学字符识别模型增强MLLMs以提高细粒度图像理解和减少响应幻影的实证研究。我们的研究探讨了基于嵌入的检测信息 infusion 对MLLMs原始能力的影响以及检测模型的互换性。我们使用LLaVA-1.5、DINO和PaddleOCRv2等模型进行了系统实验,结果表明,我们的方法不仅提高了MLLMs在特定视觉任务上的性能,而且保持了其原始优势。增强后的MLLMs在9个基准测试中的表现优于SOTA模型,平均分数提高了12.99%,表明在多模态理解方面取得了显著的进展。我们将我们的代码发布出来,以促进对MLLMs细粒度多模态对话能力的进一步探索。
https://arxiv.org/abs/2401.17981
Recent advancements in deep neural networks have markedly enhanced the performance of computer vision tasks, yet the specialized nature of these networks often necessitates extensive data and high computational power. Addressing these requirements, this study presents a novel neural network model adept at optical character recognition (OCR) across diverse domains, leveraging the strengths of multi-task learning to improve efficiency and generalization. The model is designed to achieve rapid adaptation to new domains, maintain a compact size conducive to reduced computational resource demand, ensure high accuracy, retain knowledge from previous learning experiences, and allow for domain-specific performance improvements without the need to retrain entirely. Rigorous evaluation on open datasets has validated the model's ability to significantly lower the number of trainable parameters without sacrificing performance, indicating its potential as a scalable and adaptable solution in the field of computer vision, particularly for applications in optical text recognition.
近年来,在深度神经网络方面取得了显著的进展,极大地增强了计算机视觉任务的性能。然而,这些网络的专用性质往往需要大量数据和高计算资源。为满足这些要求,本研究提出了一种名为光学字符识别(OCR)的多领域神经网络新模型,利用多任务学习的优势来提高效率和泛化。该模型旨在实现对不同领域的快速适应,保持小型化以降低计算资源需求,确保高精度,保留之前学习经验的知識,并且不需要重新训练整个模型就可以实现领域特异性性能的改善。在公开数据集上进行严格的评估证实了该模型在保持性能的同时显著降低了训练参数的数量,表明其在计算机视觉领域具有可扩展性和适应性,尤其是在光学文本识别应用中。
https://arxiv.org/abs/2401.00971
The laws of model size, data volume, computation and model performance have been extensively studied in the field of Natural Language Processing (NLP). However, the scaling laws in Optical Character Recognition (OCR) have not yet been investigated. To address this, we conducted comprehensive studies that involved examining the correlation between performance and the scale of models, data volume and computation in the field of text recognition.Conclusively, the study demonstrates smooth power laws between performance and model size, as well as training data volume, when other influencing factors are held constant. Additionally, we have constructed a large-scale dataset called REBU-Syn, which comprises 6 million real samples and 18 million synthetic samples. Based on our scaling law and new dataset, we have successfully trained a scene text recognition model, achieving a new state-ofthe-art on 6 common test benchmarks with a top-1 average accuracy of 97.42%.
自然语言处理(NLP)领域已经对模型的规模、数据量和计算性能的定律进行了广泛研究。然而,光学字符识别(OCR)中的缩放定律尚未被研究。为解决这个问题,我们进行了全面的研究,涉及了文本识别领域中模型性能、数据量和计算与性能之间的相关性。 总之,我们的研究证明了性能与模型大小之间的平滑功率定律,以及训练数据量和计算之间的平滑功率定律。此外,我们还构建了一个名为REBU-Syn的大规模数据集,包括600万真实样本和1800万合成样本。基于我们的缩放定律和新数据集,我们成功训练了一个场景文本识别模型,在6个常见测试基准上的 top-1 平均准确率达到了97.42%。
https://arxiv.org/abs/2401.00028
The widespread usage of cars and other large, heavy vehicles necessitates the development of an effective parking infrastructure. Additionally, algorithms for detection and recognition of number plates are often used to identify automobiles all around the world where standardized plate sizes and fonts are enforced, making recognition an effortless task. As a result, both kinds of data can be combined to develop an intelligent parking system focuses on the technology of Automatic Number Plate Recognition (ANPR). Retrieving characters from an inputted number plate image is the sole purpose of ANPR which is a costly procedure. In this article, we propose Chaurah, a minimal cost ANPR system that relies on a Raspberry Pi 3 that was specifically created for parking facilities. The system employs a dual-stage methodology, with the first stage being an ANPR system which makes use of two convolutional neural networks (CNNs). The primary locates and recognises license plates from a vehicle image, while the secondary performs Optical Character Recognition (OCR) to identify individualized numbers from the number plate. An application built with Flutter and Firebase for database administration and license plate record comparison makes up the second component of the overall solution. The application also acts as an user-interface for the billing mechanism based on parking time duration resulting in an all-encompassing software deployment of the study.
汽车和其他大型重型交通工具的广泛使用催生了有效的停车基础设施的发展。此外,通常使用算法来检测和识别车牌号码,以识别世界各地标准化车牌尺寸和字体的汽车,这使得识别变得轻松。因此,可以将这两种数据结合以开发专注于自动车牌识别技术(ANPR)的智能停车系统。从输入的車牌圖像中檢索字符是ANPR的唯一目的,而這是一個昂貴的過程。在本文中,我們提議Chaurah,一個最小成本的ANPR系統,該系統依賴於專門為停車場設計的Raspberry Pi 3。該系統采用雙階段方法,第一階段是使用兩個卷積神經網絡(CNN)的ANPR系統。主要從車輛圖像中檢測和識別車牌,而第二階段對車牌進行光學字符識別(OCR)以識別車牌上的個人化號碼。由Flutter和Firebase編寫的用於數據庫管理和車牌記錄比對的應用程序是整個解決方案的第二個組件。該應用程序還充当用戶界面,用於計費機制,這使得研究部署的全面性更加廣泛。
https://arxiv.org/abs/2312.16894
Text segmentation, the task of dividing a document into sections, is often a prerequisite for performing additional natural language processing tasks. Existing text segmentation methods have typically been developed and tested using clean, narrative-style text with segments containing distinct topics. Here we consider a challenging text segmentation task: dividing newspaper marriage announcement lists into units of one announcement each. In many cases the information is not structured into sentences, and adjacent segments are not topically distinct from each other. In addition, the text of the announcements, which is derived from images of historical newspapers via optical character recognition, contains many typographical errors. As a result, these announcements are not amenable to segmentation with existing techniques. We present a novel deep learning-based model for segmenting such text and show that it significantly outperforms an existing state-of-the-art method on our task.
文本分割,即将文档划分为段落,通常是对执行其他自然语言处理任务的一个先决条件。现有的文本分割方法通常使用干净、叙述性风格的文本,其中包含有明确主题的段落。在这里我们考虑一个具有挑战性的文本分割任务:将报纸结婚声明列表分割为每个声明的单位。在许多情况下,信息并没有划分为句子,相邻的段落也没有从属关系。此外,声明的文本是通过光学字符识别从历史报纸中提取的,因此包含许多排版错误。因此,这些声明无法使用现有技术进行分割。我们提出了一个基于深度学习的分割文本的新模型,并证明了它在我们的任务上显著超过了现有技术的水平。
https://arxiv.org/abs/2312.12773
Optical character recognition (OCR) is a vital process that involves the extraction of handwritten or printed text from scanned or printed images, converting it into a format that can be understood and processed by machines. This enables further data processing activities such as searching and editing. The automatic extraction of text through OCR plays a crucial role in digitizing documents, enhancing productivity, improving accessibility, and preserving historical records. This paper seeks to offer an exhaustive review of contemporary applications, methodologies, and challenges associated with Arabic Optical Character Recognition (OCR). A thorough analysis is conducted on prevailing techniques utilized throughout the OCR process, with a dedicated effort to discern the most efficacious approaches that demonstrate enhanced outcomes. To ensure a thorough evaluation, a meticulous keyword-search methodology is adopted, encompassing a comprehensive analysis of articles relevant to Arabic OCR, including both backward and forward citation reviews. In addition to presenting cutting-edge techniques and methods, this paper critically identifies research gaps within the realm of Arabic OCR. By highlighting these gaps, we shed light on potential areas for future exploration and development, thereby guiding researchers toward promising avenues in the field of Arabic OCR. The outcomes of this study provide valuable insights for researchers, practitioners, and stakeholders involved in Arabic OCR, ultimately fostering advancements in the field and facilitating the creation of more accurate and efficient OCR systems for the Arabic language.
光学字符识别(OCR)是一个关键的过程,涉及从扫描或打印图像中提取手写或印刷文本,并将其转换为可以被机器理解和处理的形式。这使得进一步的数据处理活动成为可能,例如搜索和编辑。通过OCR自动提取文本在数字化文档、提高生产率、改善可访问性以及保护历史记录中发挥了关键作用。本文旨在对当代阿拉伯语OCR应用程序、方法和挑战进行全面回顾。对OCR过程中采用的现有技术的深入分析,包括对与阿拉伯语OCR相关的文章的全面回顾,以确保彻底评估。此外,本文重点关注研究领域的空白,从而引导研究人员朝着该领域有前景的方向发展。本研究的结果为阿拉伯语OCR的研究人员、实践者和利益相关者提供了宝贵的洞见,从而推动了该领域的发展,并为阿拉伯语言创建更准确、更有效的OCR系统奠定了基础。
https://arxiv.org/abs/2312.11812
In recent years, the optical character recognition (OCR) field has been proliferating with plentiful cutting-edge approaches for a wide spectrum of tasks. However, these approaches are task-specifically designed with divergent paradigms, architectures, and training strategies, which significantly increases the complexity of research and maintenance and hinders the fast deployment in applications. To this end, we propose UPOCR, a simple-yet-effective generalist model for Unified Pixel-level OCR interface. Specifically, the UPOCR unifies the paradigm of diverse OCR tasks as image-to-image transformation and the architecture as a vision Transformer (ViT)-based encoder-decoder. Learnable task prompts are introduced to push the general feature representations extracted by the encoder toward task-specific spaces, endowing the decoder with task awareness. Moreover, the model training is uniformly aimed at minimizing the discrepancy between the generated and ground-truth images regardless of the inhomogeneity among tasks. Experiments are conducted on three pixel-level OCR tasks including text removal, text segmentation, and tampered text detection. Without bells and whistles, the experimental results showcase that the proposed method can simultaneously achieve state-of-the-art performance on three tasks with a unified single model, which provides valuable strategies and insights for future research on generalist OCR models. Code will be publicly available.
近年来,光学字符识别(OCR)领域已经出现了许多丰富的新方法,用于各种任务。然而,这些方法都是针对特定任务设计的具有分叉范式、架构和训练策略的,这使得研究和维护变得复杂,并阻碍了在应用中快速部署。为此,我们提出了UPOCR,一种简单而有效的统一像素级OCR接口通用模型。具体来说,UPOCR将多样OCR任务的范式统一为图像到图像变换,将架构统一为基于视觉Transformer(ViT)的编码器-解码器。引入了可学习任务提示,将编码器提取的通用特征表示推向任务特定空间,使解码器具有任务意识。此外,模型训练目标是在不同任务间保持生成图像和真实图像之间的差异最小化。实验对包括文本去除、文本分割和篡改文本检测在内的三个像素级OCR任务进行了研究。没有花言巧语,实验结果表明,所提出的统一单模型可以同时实现三个任务的最先进性能,该模型为未来研究提供了有价值的策略和见解。代码将公开发布。
https://arxiv.org/abs/2312.02694
Optical Character Recognition is a technique that converts document images into searchable and editable text, making it a valuable tool for processing scanned documents. While the Farsi language stands as a prominent and official language in Asia, efforts to develop efficient methods for recognizing Farsi printed text have been relatively limited. This is primarily attributed to the languages distinctive features, such as cursive form, the resemblance between certain alphabet characters, and the presence of numerous diacritics and dot placement. On the other hand, given the substantial training sample requirements of deep-based architectures for effective performance, the development of such datasets holds paramount significance. In light of these concerns, this paper aims to present a novel large-scale dataset, IDPL-PFOD2, tailored for Farsi printed text recognition. The dataset comprises 2003541 images featuring a wide variety of fonts, styles, and sizes. This dataset is an extension of the previously introduced IDPL-PFOD dataset, offering a substantial increase in both volume and diversity. Furthermore, the datasets effectiveness is assessed through the utilization of both CRNN-based and Vision Transformer architectures. The CRNN-based model achieves a baseline accuracy rate of 78.49% and a normalized edit distance of 97.72%, while the Vision Transformer architecture attains an accuracy of 81.32% and a normalized edit distance of 98.74%.
光学字符识别是一种将文档图像转换为可搜索和可编辑文本的技术,使其成为处理扫描文档的有价值的工具。虽然波斯语作为一种突出和官方的语言亚洲具有突出地位,但开发有效的识别波斯语印刷文本的方法相对有限。这主要归因于其独特的特征,如手写形式、某些字母之间的相似性以及大量的小写和点状排列。另一方面,由于基于深度架构的架构对有效性能的训练样本需求很大,开发这样的数据集具有关键意义。鉴于这些担忧,本文旨在介绍一个专为波斯语印刷文本识别而设计的大型数据集——IDPL-PFOD2。该数据集包括2003541张具有各种字体、风格和大小的图像。这个数据集是之前介绍的IDPL-PFOD数据集的扩展,提供了极大的数据量和多样性。此外,通过使用CRNN和Vision Transformer架构对数据集的有效性进行评估。基于CRNN的模型实现基线准确率为78.49%,归一化编辑距离为97.72%,而基于Vision Transformer的架构实现准确率为81.32%,归一化编辑距离为98.74%。
https://arxiv.org/abs/2312.01177
Recent advancements in Optical Character Recognition (OCR) have been driven by transformer-based models. OCR systems are critical in numerous high-stakes domains, yet their vulnerability to adversarial attack remains largely uncharted territory, raising concerns about security and compliance with emerging AI regulations. In this work we present a novel framework to assess the resilience of Transformer-based OCR (TrOCR) models. We develop and assess algorithms for both targeted and untargeted attacks. For the untargeted case, we measure the Character Error Rate (CER), while for the targeted case we use the success ratio. We find that TrOCR is highly vulnerable to untargeted attacks and somewhat less vulnerable to targeted attacks. On a benchmark handwriting data set, untargeted attacks can cause a CER of more than 1 without being noticeable to the eye. With a similar perturbation size, targeted attacks can lead to success rates of around $25\%$ -- here we attacked single tokens, requiring TrOCR to output the tenth most likely token from a large vocabulary.
近年来,在自然语言处理(NLP)领域,特别是基于Transformer的模型在光学字符识别(OCR)方面的进步取得了重要突破。OCR系统在许多高风险领域至关重要,但它们对攻击的抵抗力仍然是一个未被充分探索的领域,这引发了关于安全和人工智能法规遵守方面的担忧。在这项工作中,我们提出了一个评估Transformer-based OCR(TrOCR)模型韧性的新框架。我们开发并评估了针对目标和无目标攻击的算法。对于无目标攻击,我们测量字符误差率(CER),而对于有目标攻击,我们使用成功率。我们发现,TrOCR在无目标攻击上高度脆弱,在有目标攻击上 somewhat less vulnerable。在一个人工手写数据集上进行评估,无目标攻击可能导致超过1的CER,而目标攻击可能导致约25%的成功率。在这里,我们攻击单个标记,要求TrOCR从大型词汇中输出最有可能的标记。
https://arxiv.org/abs/2311.17128
This paper introduces a novel approach to post-Optical Character Recognition Correction (POC) for handwritten Cyrillic text, addressing a significant gap in current research methodologies. This gap is due to the lack of large text corporas that provide OCR errors for further training of language-based POC models, which are demanding in terms of corpora size. Our study primarily focuses on the development and application of a synthetic handwriting generation engine based on Bézier curves. Such an engine generates highly realistic handwritten text in any amounts, which we utilize to create a substantial dataset by transforming Russian text corpora sourced from the internet. We apply a Handwritten Text Recognition (HTR) model to this dataset to identify OCR errors, forming the basis for our POC model training. The correction model is trained on a 90-symbol input context, utilizing a pre-trained T5 architecture with a seq2seq correction task. We evaluate our approach on HWR200 and School_notebooks_RU datasets as they provide significant challenges in the HTR domain. Furthermore, POC can be used to highlight errors for teachers, evaluating student performance. This can be done simply by comparing sentences before and after correction, displaying differences in text. Our primary contribution lies in the innovative use of Bézier curves for Cyrillic text generation and subsequent error correction using a specialized POC model. We validate our approach by presenting Word Accuracy Rate (WAR) and Character Accuracy Rate (CAR) results, both with and without post-OCR correction, using real open corporas of handwritten Cyrillic text. These results, coupled with our methodology, are designed to be reproducible, paving the way for further advancements in the field of OCR and handwritten text analysis. Paper contributions can be found in this https URL
本文提出了一种新的手写体识别纠正方法,用于解决当前研究方法论中存在的显著空白。这一空白是因为缺乏大型的文本语料库,这些语料库提供了进一步训练基于语言的POC模型的错误信息。我们的研究主要关注基于Bézier曲线的手写体生成引擎的开发和应用。这种引擎可以生成任意数量高度逼真的手写文本,我们利用这个生成的大量俄语文本语料库来创建一个庞大的数据集。我们将手写文本识别(HTR)模型应用于这个数据集,以识别OCR错误,为POC模型训练奠定基础。修正模型在90个符号输入上下文中训练,利用预训练的T5架构和序列2序列修正任务。我们在HWR200和School_notebooks_RU数据集上评估我们的方法,因为这些数据集在HTR领域存在重大挑战。此外,POC可以用于教师评估学生表现。这可以通过比较修复前后的句子来简单地完成,显示文本中的差异。我们主要的贡献在于创新地使用Bézier曲线生成手写体和利用专用POC模型进行错误纠正。我们通过展示Word Accuracy Rate(WAR)和Character Accuracy Rate(CAR)结果,包括修复前和修复后的结果,使用真实的手写体拉丁文语料库进行验证。这些结果与我们的方法结合在一起,旨在为OCR和手写体分析领域带来进一步的进步。您可以在该链接找到论文贡献:https://url.cnki.net/ after-correction
https://arxiv.org/abs/2311.15896
Linked Data is used in various fields as a new way of structuring and connecting data. Cultural heritage institutions have been using linked data to improve archival descriptions and facilitate the discovery of information. Most archival records have digital representations of physical artifacts in the form of scanned images that are non-machine-readable. Optical Character Recognition (OCR) recognizes text in images and translates it into machine-encoded text. This paper evaluates the impact of image processing methods and parameter tuning in OCR applied to typewritten cultural heritage documents. The approach uses a multi-objective problem formulation to minimize Levenshtein edit distance and maximize the number of words correctly identified with a non-dominated sorting genetic algorithm (NSGA-II) to tune the methods' parameters. Evaluation results show that parameterization by digital representation typology benefits the performance of image pre-processing algorithms in OCR. Furthermore, our findings suggest that employing image pre-processing algorithms in OCR might be more suitable for typologies where the text recognition task without pre-processing does not produce good results. In particular, Adaptive Thresholding, Bilateral Filter, and Opening are the best-performing algorithms for the theatre plays' covers, letters, and overall dataset, respectively, and should be applied before OCR to improve its performance.
作为一种新的数据组织和连接数据的方法,链接数据在各种领域得到了广泛应用。文化遗产机构已经使用链接数据来改善档案馆描述并促进信息的发现。大多数档案馆记录的数字形式是扫描图像,这些图像无法被机器阅读。光学字符识别(OCR)识别图像中的文本并将其转换为机器编码文本。本文评估了应用于手写文化遗产文档的图像处理方法和参数调整对OCR的影响。该方法使用多目标问题求解来最小化Levenshtein编辑距离并最大化非支配排序遗传算法(NSGA-II)正确识别非支配排序单词的数量,以调整方法参数。评估结果显示,通过数字表示类型学对参数进行调整可以提高OCR前处理算法的性能。此外,我们的研究结果表明,在OCR中应用图像预处理算法可能更适用于那些没有预处理文本识别任务产生良好结果的字体。特别是,Adaptive Thresholding、Bilateral Filter和Opening是剧院剧本封面、信件和整体数据集的最佳表现算法,应在与OCR一起应用前进行改善其性能。
https://arxiv.org/abs/2311.15740
Iris recognition systems, operating in the near infrared spectrum (NIR), have demonstrated vulnerability to presentation attacks, where an adversary uses artifacts such as cosmetic contact lenses, artificial eyes or printed iris images in order to circumvent the system. At the same time, a number of effective presentation attack detection (PAD) methods have been developed. These methods have demonstrated success in detecting artificial eyes (e.g., fake Van Dyke eyes) as presentation attacks. In this work, we seek to alter the optical characteristics of artificial eyes by affixing Vanadium Dioxide (VO2) films on their surface in various spatial configurations. VO2 films can be used to selectively transmit NIR light and can, therefore, be used to regulate the amount of NIR light from the object that is captured by the iris sensor. We study the impact of such images produced by the sensor on two state-of-the-art iris PA detection methods. We observe that the addition of VO2 films on the surface of artificial eyes can cause the PA detection methods to misclassify them as bonafide eyes in some cases. This represents a vulnerability that must be systematically analyzed and effectively addressed.
近红外(NIR)识别系统操作在近红外频段(NIR)中,已经证明了易受展示攻击的漏洞,在这种攻击中,攻击者使用 cosmetic 接触镜、人工眼睛或打印的虹膜图像等物品绕过系统。同时,已经开发了许多有效的展示攻击检测(PAD)方法。这些方法在检测人工眼睛(例如,假 Van Dyke 眼睛)作为展示攻击方面取得了成功。在这项工作中,我们试图通过在人工眼睛表面粘贴 Vanadium Dioxide(VO2) films 来改变人工眼睛的光学特性。VO2 films 可以专门传输 NIR 光,因此可以用来调节被瞳孔传感器捕获的对象的 NIR 光量。我们研究了这种传感器产生的图像对两种最先进的虹膜PA检测方法的影响。我们观察到,在某些情况下,在人工眼睛表面粘贴 VO2 films 会使得 PAD 检测方法将其误判为真正的眼睛。这表示必须系统地分析并有效地解决这个漏洞。
https://arxiv.org/abs/2311.12773
Existing visual parsers for molecule diagrams translate pixel-based raster images such as PNGs to chemical structure representations (e.g., SMILES). However, PDFs created by word processors including \LaTeX{} and Word provide explicit locations and shapes for characters, lines, and polygons. We %introduce a method to extract symbols from born-digital PDF molecule images and then apply simple graph transformations to capture both visual and chemical structure in editable ChemDraw files (CDXML). Our fast ( PDF $\rightarrow$ visual graph $\rightarrow$ chemical graph ) pipeline does not require GPUs, Optical Character Recognition (OCR) or vectorization. We evaluate on standard benchmarks using SMILES strings, along with a novel evaluation that provides graph-based metrics and error compilation using LgEval. The geometric information in born-digital PDFs produces a highly accurate parser, motivating generating training data for visual parsers that recognize from raster images, with extracted graphics, visual structure, and chemical structure as annotations. To do this we render SMILES strings in Indigo, parse molecule structure, and then validate recognized structure to select correct files.
现有的分子图翻译器将像素基于光栅图像(如PNG)翻译为化学结构表示(例如,SMILES)。然而,由文字处理器生成的PDF提供了字符、线和多边形的明确位置和形状。我们 %引入了一种从出生数字PDF分子图像中提取符号的方法,然后对可编辑的ChemDraw文件(CDXML)应用简单图形变换来捕捉视觉和化学结构。我们的快速(PDF $\rightarrow$ 视觉图形 $\rightarrow$ 化学图形)流程不需要GPU,光学字符识别(OCR)或向量化。我们使用SMILES字符串在标准基准测试上进行评估,同时还使用一种新的评估方法,该方法提供了基于图的指标和利用LgEval进行错误编译。分子数字PDF中的几何信息产生了一个高度准确的解析器,从而 motivation为从光栅图像中识别的视觉解析器生成训练数据,并将提取的图形、视觉结构和化学结构作为注释。为此,我们在Indigo中渲染SMILES字符串,解析分子结构,然后验证识别的结构以选择正确的文件。
https://arxiv.org/abs/2311.12161
The performance of optical character recognition (OCR) heavily relies on document image quality, which is crucial for automatic document processing and document intelligence. However, most existing document enhancement methods require supervised data pairs, which raises concerns about data separation and privacy protection, and makes it challenging to adapt these methods to new domain pairs. To address these issues, we propose DECDM, an end-to-end document-level image translation method inspired by recent advances in diffusion models. Our method overcomes the limitations of paired training by independently training the source (noisy input) and target (clean output) models, making it possible to apply domain-specific diffusion models to other pairs. DECDM trains on one dataset at a time, eliminating the need to scan both datasets concurrently, and effectively preserving data privacy from the source or target domain. We also introduce simple data augmentation strategies to improve character-glyph conservation during translation. We compare DECDM with state-of-the-art methods on multiple synthetic data and benchmark datasets, such as document denoising and {\color{black}shadow} removal, and demonstrate the superiority of performance quantitatively and qualitatively.
OCR性能的优劣很大程度上取决于文档图像的质量,这对自动文档处理和文档智能至关重要。然而,现有的文档增强方法通常需要监督数据对,这引发了数据分离和隐私保护的问题,使得将 these 方法适应新的领域对具有挑战性。为了应对这些问题,我们提出了 DECDM,一种基于近年来扩散模型进展的端到端文档级别图像转换方法。我们的方法通过独立训练源(带噪音的输入)和目标(干净输出)模型,克服了配对训练的局限性,使得可以在其他配对上应用领域特定的扩散模型。DECDM逐一站在一个数据集中训练,消除了同时扫描两个数据集的需要,有效保护了源或目标域中的数据隐私。我们还引入了简单的数据增强策略,以改善翻译过程中字符 glyph 的保留。我们在多个合成数据和基准数据集上与最先进的方法进行了比较,如文档去噪和 {\color{black}阴影} 消除,并展示了性能的定量定性比较和定性比较。
https://arxiv.org/abs/2311.09625
Understanding visually situated language requires recognizing text and visual elements, and interpreting complex layouts. State-of-the-art methods commonly use specialized pre-processing tools, such as optical character recognition (OCR) systems, that map document image inputs to extracted information in the space of textual tokens, and sometimes also employ large language models (LLMs) to reason in text token space. However, the gains from external tools and LLMs come at the cost of increased computational and engineering complexity. In this paper, we ask whether small pretrained image-to-text models can learn selective text or layout recognition and reasoning as an intermediate inference step in an end-to-end model for pixel-level visual language understanding. We incorporate the outputs of such OCR tools, LLMs, and larger multimodal models as intermediate ``rationales'' on training data, and train a small student model to predict both rationales and answers for input questions based on those training examples. A student model based on Pix2Struct (282M parameters) achieves consistent improvements on three visual document understanding benchmarks representing infographics, scanned documents, and figures, with improvements of more than 4\% absolute over a comparable Pix2Struct model that predicts answers directly.
理解视觉本位语言需要识别文本和视觉元素,并解释复杂布局。最先进的方法通常使用专门的预处理工具,如光学字符识别(OCR)系统,将文档图像输入映射到文本上下文中的提取信息,并且有时还使用大型语言模型(LLMs)进行文本词空间推理。然而,从外部工具和LLM获得的收益是以增加计算和工程复杂性为代价的。在本文中,我们询问是否小型的预训练图像到文本模型可以在端到端像素级视觉语言理解的 intermediate inference 步骤中学习选择性的文本或布局识别和推理。我们在训练数据上使用这些 OCR 工具、LLM 和更大多模态模型的输出作为中间 "推理",并训练了一个小学生模型,根据这些训练示例预测输入问题的推理和答案。基于Pix2Struct(282M参数)的学生模型在代表信息图、扫描文档和图形的三个视觉文档理解基准测试中实现了持续的改进,相对改进超过4%的绝对值。
https://arxiv.org/abs/2311.09612
This paper introduces the off-road motorcycle Racer number Dataset (RnD), a new challenging dataset for optical character recognition (OCR) research. RnD contains 2,411 images from professional motorsports photographers that depict motorcycle racers in off-road competitions. The images exhibit a wide variety of factors that make OCR difficult, including mud occlusions, motion blur, non-standard fonts, glare, complex backgrounds, etc. The dataset has 5,578 manually annotated bounding boxes around visible motorcycle numbers, along with transcribed digits and letters. Our experiments benchmark leading OCR algorithms and reveal an end-to-end F1 score of only 0.527 on RnD, even after fine-tuning. Analysis of performance on different occlusion types shows mud as the primary challenge, degrading accuracy substantially compared to normal conditions. But the models struggle with other factors including glare, blur, shadows, and dust. Analysis exposes substantial room for improvement and highlights failure cases of existing models. RnD represents a valuable new benchmark to drive innovation in real-world OCR capabilities. The authors hope the community will build upon this dataset and baseline experiments to make progress on the open problem of robustly recognizing text in unconstrained natural environments. The dataset is available at this https URL.
本文介绍了一个新的具有挑战性的摩托车赛车手数据集(RnD),为光学字符识别(OCR)研究提供一个全新的难题。RnD包含了2,411张来自专业摩托车摄影师的专业摩托车赛车手在赛道比赛中的照片。这些照片展示了使OCR困难的各种因素,包括泥泞遮挡、运动模糊、非标准的字体、炫光、复杂的背景等。数据集周围有5,578个手动标注的可见摩托车号码的边界框,以及转录的数字和字母。我们对数据集的性能进行了基准测试,并发现即使在精调后,端到端F1得分也只有0.527。分析不同遮挡类型的性能显示,泥泞是主要挑战,导致准确性大幅下降。但是,模型在包括炫光、模糊、阴影和灰尘等其他因素上表现不佳。分析揭示了很大的改进空间,并突出了现有模型的失败案例。RnD代表了一个有价值的新的挑战,以推动在现实世界中实现OCR能力的创新。作者希望社区将这个数据集和基准实验建立在之上,以解决在不受约束的自然环境中准确识别文本的开放性问题。该数据集的URL为https:// this URL.
https://arxiv.org/abs/2311.09256
This paper introduces DONUT-hole, a sparse OCR-free visual document understanding (VDU) model that addresses the limitations of its predecessor model, dubbed DONUT. The DONUT model, leveraging a transformer architecture, overcoming the challenges of separate optical character recognition (OCR) and visual semantic understanding (VSU) components. However, its deployment in production environments and edge devices is hindered by high memory and computational demands, particularly in large-scale request services. To overcome these challenges, we propose an optimization strategy based on knowledge distillation and model pruning. Our paradigm to produce DONUT-hole, reduces the model denisty by 54\% while preserving performance. We also achieve a global representational similarity index between DONUT and DONUT-hole based on centered kernel alignment (CKA) metric of 0.79. Moreover, we evaluate the effectiveness of DONUT-hole in the document image key information extraction (KIE) task, highlighting its potential for developing more efficient VDU systems for logistic companies.
本文介绍了一种名为DONUT-hole的稀疏OCR-免费视觉文档理解(VDU)模型,它克服了其前身的局限性,前名为DONUT。DONUT模型利用Transformer架构,克服了光学字符识别(OCR)和视觉语义理解(VSU)组件的挑战。然而,其在生产环境和边缘设备上的部署受到高内存和计算需求的限制,特别是在大规模请求服务中。为了克服这些挑战,我们提出了基于知识蒸馏和模型剪裁的优化策略。我们的基于中心卷积对齐(CKA)指标的范式将模型的密度减少了54\%。此外,我们还通过CKA指标评估了DONUT-hole在文档图像关键信息提取(KIE)任务中的有效性,强调了其在开发更高效的VDU系统方面的潜力,尤其是在对面向逻辑公司的公司。
https://arxiv.org/abs/2311.05778