Scene-Text Visual Question Answering (ST-VQA) aims to understand scene text in images and answer questions related to the text content. Most existing methods heavily rely on the accuracy of Optical Character Recognition (OCR) systems, and aggressive fine-tuning based on limited spatial location information and erroneous OCR text information often leads to inevitable overfitting. In this paper, we propose a multimodal adversarial training architecture with spatial awareness capabilities. Specifically, we introduce an Adversarial OCR Enhancement (AOE) module, which leverages adversarial training in the embedding space of OCR modality to enhance fault-tolerant representation of OCR texts, thereby reducing noise caused by OCR errors. Simultaneously, We add a Spatial-Aware Self-Attention (SASA) mechanism to help the model better capture the spatial relationships among OCR tokens. Various experiments demonstrate that our method achieves significant performance improvements on both the ST-VQA and TextVQA datasets and provides a novel paradigm for multimodal adversarial training.
场景文本视觉问答(ST-VQA)旨在理解图像中的场景文本,并回答与文本内容相关的问题。现有方法很大程度上依赖于光学字符识别(OCR)系统的准确性,而且基于有限的空间位置信息和错误的OCR文本信息进行激进的微调往往会导致不可预测的过拟合。在本文中,我们提出了一个具有空间感知能力的多模态对抗训练架构。具体来说,我们引入了一个对抗性OCR增强(AOE)模块,它利用OCR模态的嵌入空间中的对抗性训练来增强OCR文本的容错表示,从而减少由OCR错误引起的噪声。同时,我们还添加了一个空间感知自注意力(SASA)机制,帮助模型更好地捕捉OCR词汇之间的空间关系。各种实验结果表明,我们的方法在ST-VQA和TextVQA数据集上都取得了显著的性能提升,并为多模态对抗训练树立了新的范例。
https://arxiv.org/abs/2403.09288
Chinese Spell Checking (CSC) is a widely used technology, which plays a vital role in speech to text (STT) and optical character recognition (OCR). Most of the existing CSC approaches relying on BERT architecture achieve excellent performance. However, limited by the scale of the foundation model, BERT-based method does not work well in few-shot scenarios, showing certain limitations in practical applications. In this paper, we explore using an in-context learning method named RS-LLM (Rich Semantic based LLMs) to introduce large language models (LLMs) as the foundation model. Besides, we study the impact of introducing various Chinese rich semantic information in our framework. We found that by introducing a small number of specific Chinese rich semantic structures, LLMs achieve better performance than the BERT-based model on few-shot CSC task. Furthermore, we conduct experiments on multiple datasets, and the experimental results verified the superiority of our proposed framework.
中文 Spell Checking(CSC)是一种广泛使用的技术,对语音到文本(STT)和光学字符识别(OCR)起着关键作用。大多数现有的 CSC 方法都依赖 BERT 架构,实现出色的性能。然而,由于基础模型规模有限,基于 BERT 的方法在少样本场景下表现不佳,在实际应用中存在局限性。在本文中,我们探讨了使用名为 RS-LLM(基于丰富语义的 LLM)的上下文学习方法来引入大型语言模型(LLMs)作为基础模型。此外,我们研究了在我们的框架中引入各种中文丰富语义信息的影响。我们发现,通过引入少量特定的中文丰富语义结构,LLMs 在少样本 CSC 任务上比基于 BERT 的模型实现更好的性能。此外,我们对多个数据集进行了实验,并验证了我们提出框架的优越性。
https://arxiv.org/abs/2403.08492
The rise of large language models (LLMs) and instruction tuning has led to the current trend of instruction-tuned large language and vision models (LLVMs). This trend involves either meticulously curating numerous instruction tuning datasets tailored to specific objectives or enlarging LLVMs to manage vast amounts of vision language (VL) data. However, current LLVMs have disregarded the detailed and comprehensive real-world scene understanding available from specialized computer vision (CV) models in visual perception tasks such as segmentation, detection, scene graph generation (SGG), and optical character recognition (OCR). Instead, the existing LLVMs rely mainly on the large capacity and emergent capabilities of their LLM backbones. Therefore, we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models. MoAI operates through two newly introduced modules: MoAI-Compressor and MoAI-Mixer. After verbalizing the outputs of the external CV models, the MoAI-Compressor aligns and condenses them to efficiently use relevant auxiliary visual information for VL tasks. MoAI-Mixer then blends three types of intelligence (1) visual features, (2) auxiliary features from the external CV models, and (3) language features by utilizing the concept of Mixture of Experts. Through this integration, MoAI significantly outperforms both open-source and closed-source LLVMs in numerous zero-shot VL tasks, particularly those related to real-world scene understanding such as object existence, positions, relations, and OCR without enlarging the model size or curating extra visual instruction tuning datasets.
大规模语言模型(LLMs)和指令调整的发展导致出现了一种以指令为单位的 large language 和视觉模型(LLVMs)的当前趋势。这种趋势涉及要么精心挑选针对特定目标的指令调整数据集,要么将 LLVMs 扩展以处理大量视觉语言(VL)数据。然而,现有的 LLVMs 忽略了从专用计算机视觉(CV)模型在视觉感知任务(如分割、检测、场景图生成(SGG)和光学字符识别(OCR)中获得的详细和全面的真实场景理解。相反,现有的 LLVMs 主要依赖其 LLM 骨干的 大容量和新兴功能。因此,我们提出了一个新的 LLVM,混合智能(MoAI),它利用外部分割、检测、SGG 和 OCR 模型的输出获得辅助视觉信息。MoAI 通过两个新模块运行:MoAI-Compressor 和 MoAI-Mixer。在对外部 CV 模型的输出进行口头说明后,MoAI-Compressor 对其进行对齐和压缩,以便为 VL 任务有效地使用相关辅助视觉信息。然后,MoAI-Mixer 通过利用专家混合的概念,将三种智能(1)视觉特征、(2)外部 CV 模型的辅助特征和(3)语言特征进行混合。通过这种整合,MoAI 在许多零散 VL 任务中显著优于开源和闭源 LLVMs,尤其是与现实场景理解相关的任务,例如物体存在、位置、关系和 OCR,而不需要扩大模型大小或额外视觉指令调整数据集。
https://arxiv.org/abs/2403.07508
Academic documents are packed with texts, equations, tables, and figures, requiring comprehensive understanding for accurate Optical Character Recognition (OCR). While end-to-end OCR methods offer improved accuracy over layout-based approaches, they often grapple with significant repetition issues, especially with complex layouts in Out-Of-Domain (OOD) this http URL tackle this issue, we propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression. We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols. LOCR adeptly handles various formatting elements and generates content in Markdown language. It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.LOCR also reduces repetition frequency from 4.4% of pages to 0.5% in the arXiv dataset, from 13.2% to 1.3% in OOD quantum physics documents and from 8.1% to 1.8% in OOD marketing documents. Additionally, LOCR features an interactive OCR mode, facilitating the generation of complex documents through a few location prompts from human.
学术文档充满了文本、方程、表格和图形,需要全面理解才能实现准确的光学字符识别(OCR)。虽然端到端OCR方法在布局基础上提供了更高的准确性,但它们往往面临着显著的重复问题,尤其是在复杂的离散域(OOD)中。为了解决这个问题,我们提出了LOCR,一种将位置引导集成到自回归架构中的模型。我们在包括125K个学术文档页面的770M个文本位置对训练模型,包括单词、表格和数学符号的边界框。LOCR擅长处理各种格式要素并生成Markdown语言的内容。它在我们的测试集中超越了所有现有方法,以通过编辑距离、BLEU、METEOR和F-分数进行衡量。LOCR还在arXiv数据集中将重复频率从每页的4.4%降低到0.5%,从13.2%降低到1.3%,从8.1%降低到1.8%。此外,LOCR具有交互式OCR模式,通过来自人类的一些位置提示来生成复杂的文档。
https://arxiv.org/abs/2403.02127
Language Models (LMs) such as BERT, have been shown to perform well on the task of identifying Named Entities (NE) in text. A BERT LM is typically used as a classifier to classify individual tokens in the input text, or to classify spans of tokens, as belonging to one of a set of possible NE categories. In this paper, we hypothesise that decoder-only Large Language Models (LLMs) can also be used generatively to extract both the NE, as well as potentially recover the correct surface form of the NE, where any spelling errors that were present in the input text get automatically corrected. We fine-tune two BERT LMs as baselines, as well as eight open-source LLMs, on the task of producing NEs from text that was obtained by applying Optical Character Recognition (OCR) to images of Japanese shop receipts; in this work, we do not attempt to find or evaluate the location of NEs in the text. We show that the best fine-tuned LLM performs as well as, or slightly better than, the best fine-tuned BERT LM, although the differences are not significant. However, the best LLM is also shown to correct OCR errors in some cases, as initially hypothesised.
已经证明,像BERT这样的语言模型在识别命名实体(NE)方面表现良好。通常,BERT LM被用作分类器,将输入文本中的单个词分类为属于一系列可能NE类别的 span。在本文中,我们假设可以使用 decoder-only Large Language Models(LLMs)进行生成,既可以提取NE,又可以可能恢复NE的正确表面形式,其中输入文本中存在的任何拼写错误都会自动更正。我们对两个BERT LM和八个开源LLM进行了微调,作为 baseline,并在对日本商店收据进行光学字符识别(OCR)后生成NE的任务上进行了训练。在这项工作中,我们没有试图在文本中寻找或评估NE的位置。我们证明了最佳微调的LLM的表现与最佳微调的BERT LM相当,或者略有更好;尽管这些差异并不显著。然而,最佳LLM也被发现可以在某些情况下纠正OCR错误,正如最初假设的那样。
https://arxiv.org/abs/2403.00528
In recent years, text-image joint pre-training techniques have shown promising results in various tasks. However, in Optical Character Recognition (OCR) tasks, aligning text instances with their corresponding text regions in images poses a challenge, as it requires effective alignment between text and OCR-Text (referring to the text in images as OCR-Text to distinguish from the text in natural language) rather than a holistic understanding of the overall image content. In this paper, we propose a new pre-training method called OCR-Text Destylization Modeling (ODM) that transfers diverse styles of text found in images to a uniform style based on the text prompt. With ODM, we achieve better alignment between text and OCR-Text and enable pre-trained models to adapt to the complex and diverse styles of scene text detection and spotting tasks. Additionally, we have designed a new labeling generation method specifically for ODM and combined it with our proposed Text-Controller module to address the challenge of annotation costs in OCR tasks, allowing a larger amount of unlabeled data to participate in pre-training. Extensive experiments on multiple public datasets demonstrate that our method significantly improves performance and outperforms current pre-training methods in scene text detection and spotting tasks. Code is available at {this https URL}.
近年来,在各种任务中,文本图像联合预训练技术取得了良好的效果。然而,在光学字符识别(OCR)任务中,将文本实例与图像中的相应文本区域对齐是一个挑战,因为它需要将文本和OCR-文本(将图像中的文本称为OCR-文本,以与自然语言中的文本区分开来)之间的有效对齐,而不是对整个图像内容的全面理解。在本文中,我们提出了一个名为OCR-Text Destylization Modeling(ODM)的新预训练方法,它基于文本提示将图像中发现的多样文本风格转移为统一风格。通过ODM,我们实现了文本和OCR-文本之间的更好对齐,并使预训练模型能够适应场景文本检测和标注任务的复杂和多样风格。此外,我们还针对ODM设计了一个新的标签生成方法,并将其与我们的Text-Controller模块相结合,以解决OCR任务中标注成本的挑战,允许更多的未标注数据参与预训练。在多个公开数据集上的广泛实验证明,我们的方法显著提高了性能,在场景文本检测和标注任务中优于当前预训练方法。代码可在此处下载:{this <https://URL>}。
https://arxiv.org/abs/2403.00303
The adoption of tablets with touchscreens and styluses is increasing, and a key feature is converting handwriting to text, enabling search, indexing, and AI assistance. Meanwhile, vision-language models (VLMs) are now the go-to solution for image understanding, thanks to both their state-of-the-art performance across a variety of tasks and the simplicity of a unified approach to training, fine-tuning, and inference. While VLMs obtain high performance on image-based tasks, they perform poorly on handwriting recognition when applied naively, i.e., by rendering handwriting as an image and performing optical character recognition (OCR). In this paper, we study online handwriting recognition with VLMs, going beyond naive OCR. We propose a novel tokenized representation of digital ink (online handwriting) that includes both a time-ordered sequence of strokes as text, and as image. We show that this representation yields results comparable to or better than state-of-the-art online handwriting recognizers. Wide applicability is shown through results with two different VLM families, on multiple public datasets. Our approach can be applied to off-the-shelf VLMs, does not require any changes in their architecture, and can be used in both fine-tuning and parameter-efficient tuning. We perform a detailed ablation study to identify the key elements of the proposed representation.
采用触摸屏和触控笔的平板电脑正在增加,关键功能是将手写内容转换为文本,实现搜索、索引和人工智能协助。同时,凭借其在各种任务上的先进性能和统一训练方法的简单性,视觉语言模型(VLMs)现在成为图像理解的绝佳解决方案。然而,当它们应用于粗暴地将手写内容转换为图像并进行光学字符识别(OCR)时,VLMs表现不佳。在本文中,我们研究使用VLMs的在线手写识别,超越了粗暴的OCR。我们提出了一种新颖的数字墨水(在线手写)分词表示,包括 strokes 时间序列作为文本,以及作为图像。我们证明了这种表示产生了与或优于最先进的在线手写识别器相同或更好的结果。通过在两个不同的VLM家族上展示结果,在多个公共数据集上进行了广泛的适用性展示。我们的方法可以应用于现成的VLMs,无需对其架构进行任何更改,并且可以用于参数高效调整。我们进行了详细的可缩放研究,以确定所提出的表示的关键要素。
https://arxiv.org/abs/2402.15307
Despite significant progress in optical character recognition (OCR) and computer vision systems, robustly recognizing text and identifying people in images taken in unconstrained \emph{in-the-wild} environments remain an ongoing challenge. However, such obstacles must be overcome in practical applications of vision systems, such as identifying racers in photos taken during off-road racing events. To this end, we introduce two new challenging real-world datasets - the off-road motorcycle Racer Number Dataset (RND) and the Muddy Racer re-iDentification Dataset (MUDD) - to highlight the shortcomings of current methods and drive advances in OCR and person re-identification (ReID) under extreme conditions. These two datasets feature over 6,300 images taken during off-road competitions which exhibit a variety of factors that undermine even modern vision systems, namely mud, complex poses, and motion blur. We establish benchmark performance on both datasets using state-of-the-art models. Off-the-shelf models transfer poorly, reaching only 15% end-to-end (E2E) F1 score on text spotting, and 33% rank-1 accuracy on ReID. Fine-tuning yields major improvements, bringing model performance to 53% F1 score for E2E text spotting and 79% rank-1 accuracy on ReID, but still falls short of good performance. Our analysis exposes open problems in real-world OCR and ReID that necessitate domain-targeted techniques. With these datasets and analysis of model limitations, we aim to foster innovations in handling real-world conditions like mud and complex poses to drive progress in robust computer vision. All data was sourced from this http URL, a website used by professional motorsports photographers, racers, and fans. The top-performing text spotting and ReID models are deployed on this platform to power real-time race photo search.
尽管在光学字符识别(OCR)和计算机视觉系统方面取得了显著的进展,但在不受约束的野外环境中准确识别文本和识别人物仍然是一个持续的挑战。然而,在视觉系统的实际应用中, such 障碍必须被克服,例如在赛车照片中识别赛车手。为此,我们引入了两个新的具有挑战性的现实世界数据集——赛车手编号数据集(RND)和泥泞赛车手重新识别数据集(MUDD),以强调在极端条件下 OCR 和人物识别(ReID)方法的不足之处,推动在不受约束的环境中实现更好的识别性能。这两个数据集涵盖了在赛车比赛中拍摄的超过 6,300 张图像,这些图像呈现出各种因素,对即使是最先进的现代视觉系统也会产生影响,例如泥、复杂的姿势和运动模糊。我们在两个数据集上使用最先进的模型进行基准性能评估。通用的模型转移差,仅达到15%的端到端(E2E) F1 分数在文本检测中,而在 ReID 方面也只有33%的排名一准确率。微调带来重大改进,将模型的性能提高到53%的E2E文本检测和79%的排名一准确率在 ReID 上,但仍然存在不足。我们的分析揭示了在现实世界 OCR 和 ReID 中需要解决的问题,这需要领域特定的技术。有了这些数据集和模型限制的分析,我们旨在推动在处理类似泥和复杂姿态的实时情况方面的创新,以推动计算机视觉在实时情况下的进步。所有数据都来自这个链接,这是一个专业赛车摄影师、赛车手和粉丝使用的网站。在这个平台上,最优秀的文本检测和 ReID 模型部署用于实时赛车照片搜索。
https://arxiv.org/abs/2402.08025
Captcha are widely used to secure systems from automatic responses by distinguishing computer responses from human responses. Text, audio, video, picture picture-based Optical Character Recognition (OCR) are used for creating captcha. Text-based OCR captcha are the most often used captcha which faces issues namely, complex and distorted contents. There are attempts to build captcha detection and classification-based systems using machine learning and neural networks, which need to be tuned for accuracy. The existing systems face challenges in the recognition of distorted characters, handling variable-length captcha and finding sequential dependencies in captcha. In this work, we propose a segmentation-free OCR model for text captcha classification based on the connectionist temporal classification loss technique. The proposed model is trained and tested on a publicly available captcha dataset. The proposed model gives 99.80\% character level accuracy, while 95\% word level accuracy. The accuracy of the proposed model is compared with the state-of-the-art models and proves to be effective. The variable length complex captcha can be thus processed with the segmentation-free connectionist temporal classification loss technique with dependencies which will be massively used in securing the software systems.
Captcha 广泛用于通过区分计算机响应与人类响应来保护系统免受自动回复。文本、音频、视频和基于图像的图像识别(OCR)用于创建 captcha。基于文本的 OCR captcha 是最常见的 captcha,它面临复杂和扭曲的内容的问题。尝试使用机器学习和神经网络基于 captcha 检测和分类构建系统,这些系统需要进行精度调整。现有的系统在识别扭曲的 characters、处理变长 captcha 和寻找序列依赖方面面临挑战。在本文中,我们提出了一种基于连接主义时间分类损失技术文本captcha分类的分割免费 OCR 模型。所提出的模型在公开可用的 captcha 数据集上进行训练和测试。所提出的模型在字符级别具有 99.80\% 的准确率,而在单词级别具有 95\% 的准确率。与最先进的模型进行比较证明效果显著。因此,可以使用基于分割免费连接主义时间分类损失技术处理变长复杂 captcha。
https://arxiv.org/abs/2402.05417
In this work, product tables in invoices are obtained autonomously via a deep learning model, which is named as ExTTNet. Firstly, text is obtained from invoice images using Optical Character Recognition (OCR) techniques. Tesseract OCR engine [37] is used for this process. Afterwards, the number of existing features is increased by using feature extraction methods to increase the accuracy. Labeling process is done according to whether each text obtained as a result of OCR is a table element or not. In this study, a multilayer artificial neural network model is used. The training has been carried out with an Nvidia RTX 3090 graphics card and taken $162$ minutes. As a result of the training, the F1 score is $0.92$.
在这项工作中,通过一个名为ExTTNet的深度学习模型,在发票中自动获取产品表是通过深度学习模型获得的。首先,使用光字符识别(OCR)技术从发票图像中获取文本。 Tesseract OCR 引擎 [37] 用于这个过程。然后,通过使用特征提取方法增加现有特征的数量来提高准确性。根据提取的 OCR 结果文本是否为表元素进行标签处理。在这项研究中,使用了多层人工神经网络模型。训练过程使用Nvidia RTX 3090图形芯片进行,用时162分钟。训练后,F1得分达到了0.92。
https://arxiv.org/abs/2402.02246
Despite the impressive capabilities of Multimodal Large Language Models (MLLMs) in integrating text and image modalities, challenges remain in accurately interpreting detailed visual elements. This paper presents an empirical study on enhancing MLLMs with state-of-the-art (SOTA) object detection and Optical Character Recognition models to improve fine-grained image understanding and reduce hallucination in responses. Our research investigates the embedding-based infusion of detection information, the impact of such infusion on the MLLMs' original abilities, and the interchangeability of detection models. We conduct systematic experiments with models such as LLaVA-1.5, DINO, and PaddleOCRv2, revealing that our approach not only refines MLLMs' performance in specific visual tasks but also maintains their original strengths. The resulting enhanced MLLMs outperform SOTA models on 9 out of 10 benchmarks, achieving an improvement of up to 12.99% on the normalized average score, marking a notable advancement in multimodal understanding. We release our codes to facilitate further exploration into the fine-grained multimodal dialogue capabilities of MLLMs.
尽管多模态大型语言模型(MLLMs)在整合文本和图像模态方面具有令人印象深刻的 capabilities,但准确解释详细视觉元素仍然具有挑战性。本文进行了一项关于通过最先进的(SOTA)目标检测和光学字符识别模型增强MLLMs以提高细粒度图像理解和减少响应幻影的实证研究。我们的研究探讨了基于嵌入的检测信息 infusion 对MLLMs原始能力的影响以及检测模型的互换性。我们使用LLaVA-1.5、DINO和PaddleOCRv2等模型进行了系统实验,结果表明,我们的方法不仅提高了MLLMs在特定视觉任务上的性能,而且保持了其原始优势。增强后的MLLMs在9个基准测试中的表现优于SOTA模型,平均分数提高了12.99%,表明在多模态理解方面取得了显著的进展。我们将我们的代码发布出来,以促进对MLLMs细粒度多模态对话能力的进一步探索。
https://arxiv.org/abs/2401.17981
Recent advancements in deep neural networks have markedly enhanced the performance of computer vision tasks, yet the specialized nature of these networks often necessitates extensive data and high computational power. Addressing these requirements, this study presents a novel neural network model adept at optical character recognition (OCR) across diverse domains, leveraging the strengths of multi-task learning to improve efficiency and generalization. The model is designed to achieve rapid adaptation to new domains, maintain a compact size conducive to reduced computational resource demand, ensure high accuracy, retain knowledge from previous learning experiences, and allow for domain-specific performance improvements without the need to retrain entirely. Rigorous evaluation on open datasets has validated the model's ability to significantly lower the number of trainable parameters without sacrificing performance, indicating its potential as a scalable and adaptable solution in the field of computer vision, particularly for applications in optical text recognition.
近年来,在深度神经网络方面取得了显著的进展,极大地增强了计算机视觉任务的性能。然而,这些网络的专用性质往往需要大量数据和高计算资源。为满足这些要求,本研究提出了一种名为光学字符识别(OCR)的多领域神经网络新模型,利用多任务学习的优势来提高效率和泛化。该模型旨在实现对不同领域的快速适应,保持小型化以降低计算资源需求,确保高精度,保留之前学习经验的知識,并且不需要重新训练整个模型就可以实现领域特异性性能的改善。在公开数据集上进行严格的评估证实了该模型在保持性能的同时显著降低了训练参数的数量,表明其在计算机视觉领域具有可扩展性和适应性,尤其是在光学文本识别应用中。
https://arxiv.org/abs/2401.00971
The laws of model size, data volume, computation and model performance have been extensively studied in the field of Natural Language Processing (NLP). However, the scaling laws in Optical Character Recognition (OCR) have not yet been investigated. To address this, we conducted comprehensive studies that involved examining the correlation between performance and the scale of models, data volume and computation in the field of text recognition.Conclusively, the study demonstrates smooth power laws between performance and model size, as well as training data volume, when other influencing factors are held constant. Additionally, we have constructed a large-scale dataset called REBU-Syn, which comprises 6 million real samples and 18 million synthetic samples. Based on our scaling law and new dataset, we have successfully trained a scene text recognition model, achieving a new state-ofthe-art on 6 common test benchmarks with a top-1 average accuracy of 97.42%.
自然语言处理(NLP)领域已经对模型的规模、数据量和计算性能的定律进行了广泛研究。然而,光学字符识别(OCR)中的缩放定律尚未被研究。为解决这个问题,我们进行了全面的研究,涉及了文本识别领域中模型性能、数据量和计算与性能之间的相关性。 总之,我们的研究证明了性能与模型大小之间的平滑功率定律,以及训练数据量和计算之间的平滑功率定律。此外,我们还构建了一个名为REBU-Syn的大规模数据集,包括600万真实样本和1800万合成样本。基于我们的缩放定律和新数据集,我们成功训练了一个场景文本识别模型,在6个常见测试基准上的 top-1 平均准确率达到了97.42%。
https://arxiv.org/abs/2401.00028
The widespread usage of cars and other large, heavy vehicles necessitates the development of an effective parking infrastructure. Additionally, algorithms for detection and recognition of number plates are often used to identify automobiles all around the world where standardized plate sizes and fonts are enforced, making recognition an effortless task. As a result, both kinds of data can be combined to develop an intelligent parking system focuses on the technology of Automatic Number Plate Recognition (ANPR). Retrieving characters from an inputted number plate image is the sole purpose of ANPR which is a costly procedure. In this article, we propose Chaurah, a minimal cost ANPR system that relies on a Raspberry Pi 3 that was specifically created for parking facilities. The system employs a dual-stage methodology, with the first stage being an ANPR system which makes use of two convolutional neural networks (CNNs). The primary locates and recognises license plates from a vehicle image, while the secondary performs Optical Character Recognition (OCR) to identify individualized numbers from the number plate. An application built with Flutter and Firebase for database administration and license plate record comparison makes up the second component of the overall solution. The application also acts as an user-interface for the billing mechanism based on parking time duration resulting in an all-encompassing software deployment of the study.
汽车和其他大型重型交通工具的广泛使用催生了有效的停车基础设施的发展。此外,通常使用算法来检测和识别车牌号码,以识别世界各地标准化车牌尺寸和字体的汽车,这使得识别变得轻松。因此,可以将这两种数据结合以开发专注于自动车牌识别技术(ANPR)的智能停车系统。从输入的車牌圖像中檢索字符是ANPR的唯一目的,而這是一個昂貴的過程。在本文中,我們提議Chaurah,一個最小成本的ANPR系統,該系統依賴於專門為停車場設計的Raspberry Pi 3。該系統采用雙階段方法,第一階段是使用兩個卷積神經網絡(CNN)的ANPR系統。主要從車輛圖像中檢測和識別車牌,而第二階段對車牌進行光學字符識別(OCR)以識別車牌上的個人化號碼。由Flutter和Firebase編寫的用於數據庫管理和車牌記錄比對的應用程序是整個解決方案的第二個組件。該應用程序還充当用戶界面,用於計費機制,這使得研究部署的全面性更加廣泛。
https://arxiv.org/abs/2312.16894
Text segmentation, the task of dividing a document into sections, is often a prerequisite for performing additional natural language processing tasks. Existing text segmentation methods have typically been developed and tested using clean, narrative-style text with segments containing distinct topics. Here we consider a challenging text segmentation task: dividing newspaper marriage announcement lists into units of one announcement each. In many cases the information is not structured into sentences, and adjacent segments are not topically distinct from each other. In addition, the text of the announcements, which is derived from images of historical newspapers via optical character recognition, contains many typographical errors. As a result, these announcements are not amenable to segmentation with existing techniques. We present a novel deep learning-based model for segmenting such text and show that it significantly outperforms an existing state-of-the-art method on our task.
文本分割,即将文档划分为段落,通常是对执行其他自然语言处理任务的一个先决条件。现有的文本分割方法通常使用干净、叙述性风格的文本,其中包含有明确主题的段落。在这里我们考虑一个具有挑战性的文本分割任务:将报纸结婚声明列表分割为每个声明的单位。在许多情况下,信息并没有划分为句子,相邻的段落也没有从属关系。此外,声明的文本是通过光学字符识别从历史报纸中提取的,因此包含许多排版错误。因此,这些声明无法使用现有技术进行分割。我们提出了一个基于深度学习的分割文本的新模型,并证明了它在我们的任务上显著超过了现有技术的水平。
https://arxiv.org/abs/2312.12773
Optical character recognition (OCR) is a vital process that involves the extraction of handwritten or printed text from scanned or printed images, converting it into a format that can be understood and processed by machines. This enables further data processing activities such as searching and editing. The automatic extraction of text through OCR plays a crucial role in digitizing documents, enhancing productivity, improving accessibility, and preserving historical records. This paper seeks to offer an exhaustive review of contemporary applications, methodologies, and challenges associated with Arabic Optical Character Recognition (OCR). A thorough analysis is conducted on prevailing techniques utilized throughout the OCR process, with a dedicated effort to discern the most efficacious approaches that demonstrate enhanced outcomes. To ensure a thorough evaluation, a meticulous keyword-search methodology is adopted, encompassing a comprehensive analysis of articles relevant to Arabic OCR, including both backward and forward citation reviews. In addition to presenting cutting-edge techniques and methods, this paper critically identifies research gaps within the realm of Arabic OCR. By highlighting these gaps, we shed light on potential areas for future exploration and development, thereby guiding researchers toward promising avenues in the field of Arabic OCR. The outcomes of this study provide valuable insights for researchers, practitioners, and stakeholders involved in Arabic OCR, ultimately fostering advancements in the field and facilitating the creation of more accurate and efficient OCR systems for the Arabic language.
光学字符识别(OCR)是一个关键的过程,涉及从扫描或打印图像中提取手写或印刷文本,并将其转换为可以被机器理解和处理的形式。这使得进一步的数据处理活动成为可能,例如搜索和编辑。通过OCR自动提取文本在数字化文档、提高生产率、改善可访问性以及保护历史记录中发挥了关键作用。本文旨在对当代阿拉伯语OCR应用程序、方法和挑战进行全面回顾。对OCR过程中采用的现有技术的深入分析,包括对与阿拉伯语OCR相关的文章的全面回顾,以确保彻底评估。此外,本文重点关注研究领域的空白,从而引导研究人员朝着该领域有前景的方向发展。本研究的结果为阿拉伯语OCR的研究人员、实践者和利益相关者提供了宝贵的洞见,从而推动了该领域的发展,并为阿拉伯语言创建更准确、更有效的OCR系统奠定了基础。
https://arxiv.org/abs/2312.11812
In recent years, the optical character recognition (OCR) field has been proliferating with plentiful cutting-edge approaches for a wide spectrum of tasks. However, these approaches are task-specifically designed with divergent paradigms, architectures, and training strategies, which significantly increases the complexity of research and maintenance and hinders the fast deployment in applications. To this end, we propose UPOCR, a simple-yet-effective generalist model for Unified Pixel-level OCR interface. Specifically, the UPOCR unifies the paradigm of diverse OCR tasks as image-to-image transformation and the architecture as a vision Transformer (ViT)-based encoder-decoder. Learnable task prompts are introduced to push the general feature representations extracted by the encoder toward task-specific spaces, endowing the decoder with task awareness. Moreover, the model training is uniformly aimed at minimizing the discrepancy between the generated and ground-truth images regardless of the inhomogeneity among tasks. Experiments are conducted on three pixel-level OCR tasks including text removal, text segmentation, and tampered text detection. Without bells and whistles, the experimental results showcase that the proposed method can simultaneously achieve state-of-the-art performance on three tasks with a unified single model, which provides valuable strategies and insights for future research on generalist OCR models. Code will be publicly available.
近年来,光学字符识别(OCR)领域已经出现了许多丰富的新方法,用于各种任务。然而,这些方法都是针对特定任务设计的具有分叉范式、架构和训练策略的,这使得研究和维护变得复杂,并阻碍了在应用中快速部署。为此,我们提出了UPOCR,一种简单而有效的统一像素级OCR接口通用模型。具体来说,UPOCR将多样OCR任务的范式统一为图像到图像变换,将架构统一为基于视觉Transformer(ViT)的编码器-解码器。引入了可学习任务提示,将编码器提取的通用特征表示推向任务特定空间,使解码器具有任务意识。此外,模型训练目标是在不同任务间保持生成图像和真实图像之间的差异最小化。实验对包括文本去除、文本分割和篡改文本检测在内的三个像素级OCR任务进行了研究。没有花言巧语,实验结果表明,所提出的统一单模型可以同时实现三个任务的最先进性能,该模型为未来研究提供了有价值的策略和见解。代码将公开发布。
https://arxiv.org/abs/2312.02694
Optical Character Recognition is a technique that converts document images into searchable and editable text, making it a valuable tool for processing scanned documents. While the Farsi language stands as a prominent and official language in Asia, efforts to develop efficient methods for recognizing Farsi printed text have been relatively limited. This is primarily attributed to the languages distinctive features, such as cursive form, the resemblance between certain alphabet characters, and the presence of numerous diacritics and dot placement. On the other hand, given the substantial training sample requirements of deep-based architectures for effective performance, the development of such datasets holds paramount significance. In light of these concerns, this paper aims to present a novel large-scale dataset, IDPL-PFOD2, tailored for Farsi printed text recognition. The dataset comprises 2003541 images featuring a wide variety of fonts, styles, and sizes. This dataset is an extension of the previously introduced IDPL-PFOD dataset, offering a substantial increase in both volume and diversity. Furthermore, the datasets effectiveness is assessed through the utilization of both CRNN-based and Vision Transformer architectures. The CRNN-based model achieves a baseline accuracy rate of 78.49% and a normalized edit distance of 97.72%, while the Vision Transformer architecture attains an accuracy of 81.32% and a normalized edit distance of 98.74%.
光学字符识别是一种将文档图像转换为可搜索和可编辑文本的技术,使其成为处理扫描文档的有价值的工具。虽然波斯语作为一种突出和官方的语言亚洲具有突出地位,但开发有效的识别波斯语印刷文本的方法相对有限。这主要归因于其独特的特征,如手写形式、某些字母之间的相似性以及大量的小写和点状排列。另一方面,由于基于深度架构的架构对有效性能的训练样本需求很大,开发这样的数据集具有关键意义。鉴于这些担忧,本文旨在介绍一个专为波斯语印刷文本识别而设计的大型数据集——IDPL-PFOD2。该数据集包括2003541张具有各种字体、风格和大小的图像。这个数据集是之前介绍的IDPL-PFOD数据集的扩展,提供了极大的数据量和多样性。此外,通过使用CRNN和Vision Transformer架构对数据集的有效性进行评估。基于CRNN的模型实现基线准确率为78.49%,归一化编辑距离为97.72%,而基于Vision Transformer的架构实现准确率为81.32%,归一化编辑距离为98.74%。
https://arxiv.org/abs/2312.01177
Recent advancements in Optical Character Recognition (OCR) have been driven by transformer-based models. OCR systems are critical in numerous high-stakes domains, yet their vulnerability to adversarial attack remains largely uncharted territory, raising concerns about security and compliance with emerging AI regulations. In this work we present a novel framework to assess the resilience of Transformer-based OCR (TrOCR) models. We develop and assess algorithms for both targeted and untargeted attacks. For the untargeted case, we measure the Character Error Rate (CER), while for the targeted case we use the success ratio. We find that TrOCR is highly vulnerable to untargeted attacks and somewhat less vulnerable to targeted attacks. On a benchmark handwriting data set, untargeted attacks can cause a CER of more than 1 without being noticeable to the eye. With a similar perturbation size, targeted attacks can lead to success rates of around $25\%$ -- here we attacked single tokens, requiring TrOCR to output the tenth most likely token from a large vocabulary.
近年来,在自然语言处理(NLP)领域,特别是基于Transformer的模型在光学字符识别(OCR)方面的进步取得了重要突破。OCR系统在许多高风险领域至关重要,但它们对攻击的抵抗力仍然是一个未被充分探索的领域,这引发了关于安全和人工智能法规遵守方面的担忧。在这项工作中,我们提出了一个评估Transformer-based OCR(TrOCR)模型韧性的新框架。我们开发并评估了针对目标和无目标攻击的算法。对于无目标攻击,我们测量字符误差率(CER),而对于有目标攻击,我们使用成功率。我们发现,TrOCR在无目标攻击上高度脆弱,在有目标攻击上 somewhat less vulnerable。在一个人工手写数据集上进行评估,无目标攻击可能导致超过1的CER,而目标攻击可能导致约25%的成功率。在这里,我们攻击单个标记,要求TrOCR从大型词汇中输出最有可能的标记。
https://arxiv.org/abs/2311.17128
This paper introduces a novel approach to post-Optical Character Recognition Correction (POC) for handwritten Cyrillic text, addressing a significant gap in current research methodologies. This gap is due to the lack of large text corporas that provide OCR errors for further training of language-based POC models, which are demanding in terms of corpora size. Our study primarily focuses on the development and application of a synthetic handwriting generation engine based on Bézier curves. Such an engine generates highly realistic handwritten text in any amounts, which we utilize to create a substantial dataset by transforming Russian text corpora sourced from the internet. We apply a Handwritten Text Recognition (HTR) model to this dataset to identify OCR errors, forming the basis for our POC model training. The correction model is trained on a 90-symbol input context, utilizing a pre-trained T5 architecture with a seq2seq correction task. We evaluate our approach on HWR200 and School_notebooks_RU datasets as they provide significant challenges in the HTR domain. Furthermore, POC can be used to highlight errors for teachers, evaluating student performance. This can be done simply by comparing sentences before and after correction, displaying differences in text. Our primary contribution lies in the innovative use of Bézier curves for Cyrillic text generation and subsequent error correction using a specialized POC model. We validate our approach by presenting Word Accuracy Rate (WAR) and Character Accuracy Rate (CAR) results, both with and without post-OCR correction, using real open corporas of handwritten Cyrillic text. These results, coupled with our methodology, are designed to be reproducible, paving the way for further advancements in the field of OCR and handwritten text analysis. Paper contributions can be found in this https URL
本文提出了一种新的手写体识别纠正方法,用于解决当前研究方法论中存在的显著空白。这一空白是因为缺乏大型的文本语料库,这些语料库提供了进一步训练基于语言的POC模型的错误信息。我们的研究主要关注基于Bézier曲线的手写体生成引擎的开发和应用。这种引擎可以生成任意数量高度逼真的手写文本,我们利用这个生成的大量俄语文本语料库来创建一个庞大的数据集。我们将手写文本识别(HTR)模型应用于这个数据集,以识别OCR错误,为POC模型训练奠定基础。修正模型在90个符号输入上下文中训练,利用预训练的T5架构和序列2序列修正任务。我们在HWR200和School_notebooks_RU数据集上评估我们的方法,因为这些数据集在HTR领域存在重大挑战。此外,POC可以用于教师评估学生表现。这可以通过比较修复前后的句子来简单地完成,显示文本中的差异。我们主要的贡献在于创新地使用Bézier曲线生成手写体和利用专用POC模型进行错误纠正。我们通过展示Word Accuracy Rate(WAR)和Character Accuracy Rate(CAR)结果,包括修复前和修复后的结果,使用真实的手写体拉丁文语料库进行验证。这些结果与我们的方法结合在一起,旨在为OCR和手写体分析领域带来进一步的进步。您可以在该链接找到论文贡献:https://url.cnki.net/ after-correction
https://arxiv.org/abs/2311.15896