Diffusion models have gained attention for image editing yielding impressive results in text-to-image tasks. On the downside, one might notice that generated images of stable diffusion models suffer from deteriorated details. This pitfall impacts image editing tasks that require information preservation e.g., scene text editing. As a desired result, the model must show the capability to replace the text on the source image to the target text while preserving the details e.g., color, font size, and background. To leverage the potential of diffusion models, in this work, we introduce Diffusion-BasEd Scene Text manipulation Network so-called DBEST. Specifically, we design two adaptation strategies, namely one-shot style adaptation and text-recognition guidance. In experiments, we thoroughly assess and compare our proposed method against state-of-the-arts on various scene text datasets, then provide extensive ablation studies for each granularity to analyze our performance gain. Also, we demonstrate the effectiveness of our proposed method to synthesize scene text indicated by competitive Optical Character Recognition (OCR) accuracy. Our method achieves 94.15% and 98.12% on COCO-text and ICDAR2013 datasets for character-level evaluation.
扩散模型因在文本到图像任务中产生令人印象深刻的结果而受到了关注。然而,一个缺点是,生成的稳定扩散模型的图像在细节上会恶化。这个缺陷会影响需要保留信息的图像编辑任务,例如场景文本编辑。作为理想的结果,模型必须表现出在保留细节的同时将源图像上的文本替换为目标文本的能力,例如颜色、字体大小和背景。为了利用扩散模型的潜力,在本文中,我们引入了扩散-基于场景文本编辑网络DBEST。具体来说,我们设计了两项适应策略,即一次风格适应和文本识别指导。在实验中,我们深入评估和比较了我们的方法与现有技术的 various场景文本数据集上的表现,然后对每个粒度进行广泛的消融研究,以分析我们性能的提高。此外,我们还证明了我们方法的有效性,用于生成与竞争光学字符识别(OCR)准确性相同的场景文本。我们的方法在COCO-text和ICDAR2013数据集上的字符级别评估获得了94.15%和98.12%的准确率。
https://arxiv.org/abs/2311.00734
This paper presents a comprehensive evaluation of the Optical Character Recognition (OCR) capabilities of the recently released GPT-4V(ision), a Large Multimodal Model (LMM). We assess the model's performance across a range of OCR tasks, including scene text recognition, handwritten text recognition, handwritten mathematical expression recognition, table structure recognition, and information extraction from visually-rich document. The evaluation reveals that GPT-4V performs well in recognizing and understanding Latin contents, but struggles with multilingual scenarios and complex tasks. Based on these observations, we delve deeper into the necessity of specialized OCR models and deliberate on the strategies to fully harness the pretrained general LMMs like GPT-4V for OCR downstream tasks. The study offers a critical reference for future research in OCR with LMMs. Evaluation pipeline and results are available at this https URL.
本文对GPT-4V(Vision),一个 recently发布的Large Multimodal Model(LMM)进行了Optical Character Recognition(OCR)能力进行全面评估。我们在一系列OCR任务中评估了模型的性能,包括场景文本识别、手写文本识别、手写数学表达式识别、表结构识别和从视觉丰富的文档中提取信息。评估显示,GPT-4V在识别和理解拉丁文本方面表现良好,但在多语言场景和复杂任务上表现不佳。基于这些观察结果,我们深入研究了专用OCR模型的必要性,并考虑了如何充分利用预训练的一般LMMs(如GPT-4V)进行OCR下游任务的策略。该研究为未来的OCR研究提供了重要的参考。评估流程和结果可在此链接查看:https://url.cn/xyz6h4yx
https://arxiv.org/abs/2310.16809
Key information extraction (KIE) from scanned documents has gained increasing attention because of its applications in various domains. Although promising results have been achieved by some recent KIE approaches, they are usually built based on discriminative models, which lack the ability to handle optical character recognition (OCR) errors and require laborious token-level labelling. In this paper, we propose a novel generative end-to-end model, named GenKIE, to address the KIE task. GenKIE is a sequence-to-sequence multimodal generative model that utilizes multimodal encoders to embed visual, layout and textual features and a decoder to generate the desired output. Well-designed prompts are leveraged to incorporate the label semantics as the weakly supervised signals and entice the generation of the key information. One notable advantage of the generative model is that it enables automatic correction of OCR errors. Besides, token-level granular annotation is not required. Extensive experiments on multiple public real-world datasets show that GenKIE effectively generalizes over different types of documents and achieves state-of-the-art results. Our experiments also validate the model's robustness against OCR errors, making GenKIE highly applicable in real-world scenarios.
从扫描文档中提取关键信息(KIE)因其在各种领域的应用而受到越来越多的关注。虽然一些最近KIE方法取得了很好的结果,但它们通常基于有监督的模型构建,缺乏处理光学字符识别(OCR)错误的能力,需要进行费力的标记级token级标注。在本文中,我们提出了一个名为GenKIE的新序列到序列多模态生成模型,以解决KIE任务。GenKIE是一种序列到序列多模态生成模型,利用多模态编码器嵌入视觉、布局和文本特征,并使用解码器生成所需输出。通过设计合适的提示,充分利用标签语义作为弱监督信号,激发关键信息的生成。 生成模型的一个显著优点是,它能够自动纠正OCR错误。此外,不需要进行标记级token级标注。在多个公开的实世界数据集上的广泛实验表明,GenKIE在各种类型的文档上有效扩展,并实现了最先进的性能。我们的实验还验证了模型的鲁棒性 against OCR错误,使GenKIE在现实场景中具有很高的适用性。
https://arxiv.org/abs/2310.16131
For a financial analyst, the question and answer (Q\&A) segment of the company financial report is a crucial piece of information for various analysis and investment decisions. However, extracting valuable insights from the Q\&A section has posed considerable challenges as the conventional methods such as detailed reading and note-taking lack scalability and are susceptible to human errors, and Optical Character Recognition (OCR) and similar techniques encounter difficulties in accurately processing unstructured transcript text, often missing subtle linguistic nuances that drive investor decisions. Here, we demonstrate the utilization of Large Language Models (LLMs) to efficiently and rapidly extract information from earnings report transcripts while ensuring high accuracy transforming the extraction process as well as reducing hallucination by combining retrieval-augmented generation technique as well as metadata. We evaluate the outcomes of various LLMs with and without using our proposed approach based on various objective metrics for evaluating Q\&A systems, and empirically demonstrate superiority of our method.
对于财务分析师来说,公司财务报告中的问题与答案(Q&A)部分是进行各种分析和投资决策的关键信息。然而,从Q&A部分提取有价值的信息面临着相当大的挑战,因为传统的方法,如详细阅读和记笔记,缺乏可扩展性,容易出错,而光学字符识别(OCR)等类似技术在准确处理非结构化文本方面也存在困难,往往遗漏了投资者决策所需的细微语言细微差别。在这里,我们展示了使用大型语言模型(LLMs)从收益报告录音中高效且快速提取信息,同时确保高准确性的方法,以及结合检索增强生成技术和元数据降低伪影的方法。我们根据各种客观指标评估了各种LLM的性能,并实证证明了我们的方法的优越性。
https://arxiv.org/abs/2310.10760
Billions of public domain documents remain trapped in hard copy or lack an accurate digitization. Modern natural language processing methods cannot be used to index, retrieve, and summarize their texts; conduct computational textual analyses; or extract information for statistical analyses, and these texts cannot be incorporated into language model training. Given the diversity and sheer quantity of public domain texts, liberating them at scale requires optical character recognition (OCR) that is accurate, extremely cheap to deploy, and sample-efficient to customize to novel collections, languages, and character sets. Existing OCR engines, largely designed for small-scale commercial applications in high resource languages, often fall short of these requirements. EffOCR (EfficientOCR), a novel open-source OCR package, meets both the computational and sample efficiency requirements for liberating texts at scale by abandoning the sequence-to-sequence architecture typically used for OCR, which takes representations from a learned vision model as inputs to a learned language model. Instead, EffOCR models OCR as a character or word-level image retrieval problem. EffOCR is cheap and sample efficient to train, as the model only needs to learn characters' visual appearance and not how they are used in sequence to form language. Models in the EffOCR model zoo can be deployed off-the-shelf with only a few lines of code. Importantly, EffOCR also allows for easy, sample efficient customization with a simple model training interface and minimal labeling requirements due to its sample efficiency. We illustrate the utility of EffOCR by cheaply and accurately digitizing 20 million historical U.S. newspaper scans, evaluating zero-shot performance on randomly selected documents from the U.S. National Archives, and accurately digitizing Japanese documents for which all other OCR solutions failed.
数百万份公共领域的文档仍然被困在纸质文件或缺乏准确的数字化中。现代自然语言处理方法无法对它们进行索引、检索和摘要,进行计算性文本分析,或提取统计分析所需的资料,这些文本也无法纳入语言模型训练。考虑到公共领域文本的多样性,要在规模上解放它们,需要准确、成本低廉且针对新型收藏、语言和字符集自定义的光学字符识别(OCR)。现有的OCR引擎,主要针对用于高资源语言的小规模商业应用,往往无法满足这些要求。EffOCR是一个新型的开源OCR软件包,它满足了在规模上解放文本的光学字符识别(OCR)的计算和样本效率要求,摒弃了通常用于OCR的序列到序列架构,将输入从预训练的视觉模型转换为预训练的语言模型。相反,EffOCR将OCR建模为字符或单词级的图像检索问题。EffOCR训练起来既便宜又样本 efficient,因为模型只需要学习字符的视觉外观,而不需要了解它们在序列中的用法。EffOCR模型动物园中的模型可以用几行代码轻松部署。重要的是,EffOCR还允许通过简单的模型训练界面实现轻松、样本 efficient的自定义,并且由于其样本效率,不需要大量的标注。我们通过低成本且准确地数字化了2000万份美国历史报纸扫描,对美国国家档案馆随机选择的文档进行了零散射击性能评估,以及为其他OCR解决方案未能准确数字化的日本文档准确地进行数字化,证明了EffOCR的实用性。
https://arxiv.org/abs/2310.10050
Text-based visual question answering (TextVQA) faces the significant challenge of avoiding redundant relational inference. To be specific, a large number of detected objects and optical character recognition (OCR) tokens result in rich visual relationships. Existing works take all visual relationships into account for answer prediction. However, there are three observations: (1) a single subject in the images can be easily detected as multiple objects with distinct bounding boxes (considered repetitive objects). The associations between these repetitive objects are superfluous for answer reasoning; (2) two spatially distant OCR tokens detected in the image frequently have weak semantic dependencies for answer reasoning; and (3) the co-existence of nearby objects and tokens may be indicative of important visual cues for predicting answers. Rather than utilizing all of them for answer prediction, we make an effort to identify the most important connections or eliminate redundant ones. We propose a sparse spatial graph network (SSGN) that introduces a spatially aware relation pruning technique to this task. As spatial factors for relation measurement, we employ spatial distance, geometric dimension, overlap area, and DIoU for spatially aware pruning. We consider three visual relationships for graph learning: object-object, OCR-OCR tokens, and object-OCR token relationships. SSGN is a progressive graph learning architecture that verifies the pivotal relations in the correlated object-token sparse graph, and then in the respective object-based sparse graph and token-based sparse graph. Experiment results on TextVQA and ST-VQA datasets demonstrate that SSGN achieves promising performances. And some visualization results further demonstrate the interpretability of our method.
基于文本的视觉问题回答(TextVQA)面临着避免冗余关系推断的重要挑战。具体来说,大量检测到的物体和光学字符识别(OCR)标记会导致丰富的视觉关系。现有的工作将所有视觉关系都考虑在内来预测答案。然而,有三个观察结果:(1)图像中的单个主题很容易被认为是有多个具有不同边界框的重复物体(被视为重复物体);这些重复物之间的关联对于答案推理来说毫无价值;(2)在图像中检测到的距离较远的 OCR 标记通常在答案推理中具有弱的语义依赖性;(3)附近物体和标记的共现可能表明预测答案的重要视觉线索。因此,我们没有将所有这些信息都用于答案预测,而是努力识别最重要的连接或消除冗余的连接。我们提出了一个稀疏空间图网络(SSGN),它引入了一种空间感知关系剪枝技术来解决这个问题。作为关系测量的空间因素,我们使用空间距离、几何维度、重叠面积和 DIoU来进行空间感知剪枝。我们考虑三种图形关系进行图学习:物体-物体,OCR-OCR 标记和物体-OCR 标记关系。SSGN 是一种渐进式图学习架构,验证了相关物体-标记稀疏图中的关键关系,然后在每个物体基础稀疏图和标记基础稀疏图上。TextVQA 和 ST-VQA 数据集的实验结果表明,SSGN 取得了很好的性能。一些可视化结果进一步证明了我们的方法具有可解释性。
https://arxiv.org/abs/2310.09147
Optical Character Recognition (OCR) is a widely used tool to extract text from scanned documents. Today, the state-of-the-art is achieved by exploiting deep neural networks. However, the cost of this performance is paid at the price of system vulnerability. For instance, in backdoor attacks, attackers compromise the training phase by inserting a backdoor in the victim's model that will be activated at testing time by specific patterns while leaving the overall model performance intact. This work proposes a backdoor attack for OCR resulting in the injection of non-readable characters from malicious input images. This simple but effective attack exposes the state-of-the-art OCR weakness, making the extracted text correct to human eyes but simultaneously unusable for the NLP application that uses OCR as a preprocessing step. Experimental results show that the attacked models successfully output non-readable characters for around 90% of the poisoned instances without harming their performance for the remaining instances.
光字符识别(OCR)是一种广泛用于从扫描文档中提取文本的工具。如今,通过利用深度神经网络,最高水平达到了。然而,这种性能的提升代价是系统漏洞的代价。例如,在后门攻击中,攻击者通过在受害者的模型中插入后门,在测试时由特定模式激活,从而在训练阶段破坏了模型的整体性能。这项工作提出了一种后门攻击OCR,从恶意输入图像中注入非可读字符,以暴露OCR技术的最先进弱点。这种简单而有效的攻击让OCR技术的最先进弱点得到公开,使得提取的文本对于人眼是正确的,但同时对NLP应用程序作为预处理步骤是无法使用的。实验结果表明,攻击模型在约90%的恶意实例上成功输出了非可读字符,而没有对剩余实例的性能造成损害。
https://arxiv.org/abs/2310.08259
What happens if we encounter a suitable font for our design work but do not know its name? Visual Font Recognition (VFR) systems are used to identify the font typeface in an image. These systems can assist graphic designers in identifying fonts used in images. A VFR system also aids in improving the speed and accuracy of Optical Character Recognition (OCR) systems. In this paper, we introduce the first publicly available datasets in the field of Persian font recognition and employ Convolutional Neural Networks (CNN) to address this problem. The results show that the proposed pipeline obtained 78.0% top-1 accuracy on our new datasets, 89.1% on the IDPL-PFOD dataset, and 94.5% on the KAFD dataset. Furthermore, the average time spent in the entire pipeline for one sample of our proposed datasets is 0.54 and 0.017 seconds for CPU and GPU, respectively. We conclude that CNN methods can be used to recognize Persian fonts without the need for additional pre-processing steps such as feature extraction, binarization, normalization, etc.
如果我们遇到适合我们设计工作的合适字体,但是不知道它的名称,该怎么办呢?视觉字体识别(VFR)系统用于在图像中识别字体类型。这些系统可以帮助图形设计师识别图像中的字体。VFR系统还有助于提高光学字符识别(OCR)系统的速度和准确性。在本文中,我们在 Persian字体识别领域引入了第一个公开可用的数据集,并使用卷积神经网络(CNN)解决这个问题。结果显示,与我们的新数据集相比,所提出的管道获得了78.0%的top-1准确率,94.5%的准确率在IDPL-PFOD数据集上,以及96.7%的准确率在KAFD数据集上。此外,整个管道中一个样本所花费的平均时间,在CPU上的为0.54秒,而在GPU上的为0.017秒。我们得出结论,CNN方法可以用于没有额外预处理步骤(如特征提取、二值化、归一化等)来识别波斯字体。
https://arxiv.org/abs/2310.05255
A main goal of Argument Mining (AM) is to analyze an author's stance. Unlike previous AM datasets focusing only on text, the shared task at the 10th Workshop on Argument Mining introduces a dataset including both text and images. Importantly, these images contain both visual elements and optical characters. Our new framework, TILFA (A Unified Framework for Text, Image, and Layout Fusion in Argument Mining), is designed to handle this mixed data. It excels at not only understanding text but also detecting optical characters and recognizing layout details in images. Our model significantly outperforms existing baselines, earning our team, KnowComp, the 1st place in the leaderboard of Argumentative Stance Classification subtask in this shared task.
Argument Mining (AM) 的主要目标是对作者的观点进行分析。与之前的 AM 数据集仅关注文本不同,第十届 Argument Mining 研讨会引入了一个包括文本和图像的共享数据集。重要的是,这些图像包含视觉元素和光学字符。我们的新框架 TILFA(用于文本、图像和布局融合的统一框架)旨在处理这种混合数据。它不仅在理解文本方面表现出色,还能检测光学字符并在图像中识别布局细节。我们的模型在不仅理解文本而且能够检测光学字符和识别图像布局细节方面显著优于现有基线。因此,我们的团队 KnowComp 在共享任务中Argumentative Stance Classification子任务中的排行榜上获得了第一。
https://arxiv.org/abs/2310.05210
Digital archiving is becoming widespread owing to its effectiveness in protecting valuable books and providing knowledge to many people electronically. In this paper, we propose a novel approach to leverage digital archives for machine learning. If we can fully utilize such digitized data, machine learning has the potential to uncover unknown insights and ultimately acquire knowledge autonomously, just like humans read books. As a first step, we design a dataset construction pipeline comprising an optical character reader (OCR), an object detector, and a layout analyzer for the autonomous extraction of image-text pairs. In our experiments, we apply our pipeline on old photo books to construct an image-text pair dataset, showing its effectiveness in image-text retrieval and insight extraction.
数字 Archiving 由于它在保护珍贵书籍和向许多人电子提供知识方面的有效性而变得越来越普遍。在本文中,我们提出了一种 novel 的方法,利用数字 Archiving 来促进机器学习。如果我们能够完全利用这些数字化的数据,机器学习有潜力揭示未知的见解,并最终 autonomous 地获取知识,就像人类阅读书籍一样。作为 第一步,我们设计了一个数据集构建管道,包括一个光学字符读取器(OCR)、一个物体检测器和一个布局分析器,用于自动提取图像-文本对。在我们的实验中,我们应用我们的管道对新照片书进行图像-文本对数据集的构建,表明它在图像-文本检索和见解提取方面的有效性。
https://arxiv.org/abs/2310.01936
In the domain of Natural Language Processing (NLP), Named Entity Recognition (NER) stands out as a pivotal mechanism for extracting structured insights from unstructured text. This manuscript offers an exhaustive exploration into the evolving landscape of NER methodologies, blending foundational principles with contemporary AI advancements. Beginning with the rudimentary concepts of NER, the study spans a spectrum of techniques from traditional rule-based strategies to the contemporary marvels of transformer architectures, particularly highlighting integrations such as BERT with LSTM and CNN. The narrative accentuates domain-specific NER models, tailored for intricate areas like finance, legal, and healthcare, emphasizing their specialized adaptability. Additionally, the research delves into cutting-edge paradigms including reinforcement learning, innovative constructs like E-NER, and the interplay of Optical Character Recognition (OCR) in augmenting NER capabilities. Grounding its insights in practical realms, the paper sheds light on the indispensable role of NER in sectors like finance and biomedicine, addressing the unique challenges they present. The conclusion outlines open challenges and avenues, marking this work as a comprehensive guide for those delving into NER research and applications.
在自然语言处理(NLP)领域,命名实体识别(NER)机制是一个关键机制,可以从无格式文本中提取结构化 insights。本手稿提供了对 NER 方法学的详尽探索,将基础的建立原则与当代人工智能技术的进步相结合。从 NER 的基本概念开始,研究涵盖了从传统的基于规则的策略到Transformer架构当代奇迹的一系列技术,特别是突出了如BERT与LSTM和CNN等集成的应用。叙述强调了针对金融、法律和医疗等复杂领域的特定 NER 模型,强调了它们的专业化适应性。此外,研究还探讨了最新的范式,包括强化学习、创新的结构,如 E-NER 以及增强型NER能力的沟通互动。将其 insights 置于实践领域之上,本文揭示了在金融和生物医学等领域中NER不可或缺的作用,并探讨了它们所面临的独特挑战。结论概述了开放的挑战和途径,将本工作视为对探索NER研究和应用的全面指南。
https://arxiv.org/abs/2309.14084
Typical text recognition methods rely on an encoder-decoder structure, in which the encoder extracts features from an image, and the decoder produces recognized text from these features. In this study, we propose a simpler and more effective method for text recognition, known as the Decoder-only Transformer for Optical Character Recognition (DTrOCR). This method uses a decoder-only Transformer to take advantage of a generative language model that is pre-trained on a large corpus. We examined whether a generative language model that has been successful in natural language processing can also be effective for text recognition in computer vision. Our experiments demonstrated that DTrOCR outperforms current state-of-the-art methods by a large margin in the recognition of printed, handwritten, and scene text in both English and Chinese.
典型的文本识别方法依赖于编码-解码结构,其中编码器从图像中提取特征,解码器从这些特征中产生识别文本。在本研究中,我们提出了一种更简单但更有效的文本识别方法,称为解码器仅TransformerOptical Character Recognition (DTrOCR)。这种方法使用解码器仅Transformer利用一个在大型语料库上预训练的生成语言模型。我们检查了是否一个成功的自然语言处理生成语言模型也可以在计算机视觉中的文本识别中有效。我们的实验结果表明,DTrOCR在英语和中文印刷、手写和场景文本识别方面比当前最先进的方法领先一大步。
https://arxiv.org/abs/2308.15996
In this paper, we introduce Handwritten augmentation, a new data augmentation for handwritten character images. This method focuses on augmenting handwritten image data by altering the shape of input characters in training. The proposed handwritten augmentation is similar to position augmentation, color augmentation for images but a deeper focus on handwritten characters. Handwritten augmentation is data-driven, easy to implement, and can be integrated with CNN-based optical character recognition models. Handwritten augmentation can be implemented along with commonly used data augmentation techniques such as cropping, rotating, and yields better performance of models for handwritten image datasets developed using optical character recognition methods.
在本文中,我们介绍了手写增强(Handwritten augmentation),一种用于手写字符图像的新数据增强方法。该方法专注于通过训练时输入字符的形状变化来增强手写图像数据。 proposed手写增强与图像位置增强、颜色增强类似,但更关注手写字符。手写增强是基于数据驱动的,易于实现,并且可以与基于卷积神经网络的光学字符识别模型集成。手写增强可以与常见的数据增强技术,如裁剪、旋转和用于基于光学字符识别方法开发的手写图像数据集的性能更好的模型一起实现。
https://arxiv.org/abs/2308.13791
Document digitization is vital for preserving historical records, efficient document management, and advancing OCR (Optical Character Recognition) research. Document Layout Analysis (DLA) involves segmenting documents into meaningful units like text boxes, paragraphs, images, and tables. Challenges arise when dealing with diverse layouts, historical documents, and unique scripts like Bengali, hindered by the lack of comprehensive Bengali DLA datasets. We improved the accuracy of the DLA model for Bengali documents by utilizing advanced Mask R-CNN models available in the Detectron2 library. Our evaluation involved three variants: Mask R-CNN R-50, R-101, and X-101, both with and without pretrained weights from PubLayNet, on the BaDLAD dataset, which contains human-annotated Bengali documents in four categories: text boxes, paragraphs, images, and tables. Results show the effectiveness of these models in accurately segmenting Bengali documents. We discuss speed-accuracy tradeoffs and underscore the significance of pretrained weights. Our findings expand the applicability of Mask R-CNN in document layout analysis, efficient document management, and OCR research while suggesting future avenues for fine-tuning and data augmentation.
数字化文档对于保护历史记录、高效文档管理以及推动OCR(光学字符识别)研究至关重要。文档布局分析(DLA)涉及将文档分割为像文本框、段落、图像和表格等有意义的单元。在处理多样化的布局、历史文档以及像孟加拉文这样独特的脚本时,会面临挑战。我们通过利用Detectron2库中的高级Mask R-CNN模型来提高孟加拉文文档DLA模型的准确性。我们的评估包括三个版本:Mask R-CNN R-50、R-101和X-101,在PubLayNet提供的BaDLAD数据集上,该数据集包含由人类标注的孟加拉文文档的四个类别:文本框、段落、图像和表格。结果显示这些模型在准确分割孟加拉文文档方面的有效性。我们讨论了速度与准确性之间的权衡,并强调了预训练权重的重要性。我们的发现扩展了Mask R-CNN在文档布局分析、高效文档管理以及OCR研究中的应用,同时提出了微调和数据增强的未来途径。
https://arxiv.org/abs/2308.13769
Scientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (Neural Optical Understanding for Academic Documents), a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition.
科学知识主要存储在书籍和科学期刊中,常常以PDF格式呈现。然而,PDF格式导致语义信息的损失,特别是数学表达式。我们提出了Nougat(学术文档神经光学理解),一种视觉Transformer模型,用于处理科学文档并将其转换为标记语言,并在一个新的科学文档数据集上证明了我们模型的有效性。我们提出的这种方法提供了一个有前途的解决方案,通过填补人类可读性文档和机器可读性文本之间的差距,增强数字时代科学知识的可用性。我们发布了模型和代码,以加速未来关于科学文本识别的工作。
https://arxiv.org/abs/2308.13418
This paper discusses the challenges of optical character recognition (OCR) on natural scenes, which is harder than OCR on documents due to the wild content and various image backgrounds. We propose to uniformly use word error rates (WER) as a new measurement for evaluating scene-text OCR, both end-to-end (e2e) performance and individual system component performances. Particularly for the e2e metric, we name it DISGO WER as it considers Deletion, Insertion, Substitution, and Grouping/Ordering errors. Finally we propose to utilize the concept of super blocks to automatically compute BLEU scores for e2e OCR machine translation. The small SCUT public test set is used to demonstrate WER performance by a modularized OCR system.
本论文讨论了自然场景光学字符识别(OCR)的挑战,该领域的OCR比文档领域的OCR更难,因为该领域的内容较为 wild 且各种图像背景复杂。我们建议uniformly采用单词错误率(WER)作为评估场景文本OCR的新测量手段,包括端到端(e2e)性能和单个系统组件性能。特别是针对e2e测量,我们称之为DISGO WER,因为它考虑了删除、插入、替换和分组/排序错误。最后,我们建议利用超级块的概念,自动计算e2e OCR机器翻译的BLEU scores。小斯凯特瑞公共测试集被用于展示模块化的OCR系统的WER性能。
https://arxiv.org/abs/2308.13173
Despite the existence of numerous Optical Character Recognition (OCR) tools, the lack of comprehensive open-source systems hampers the progress of document digitization in various low resource languages, including Bengali. Low-resource languages, especially those with an alphasyllabary writing system, suffer from the lack of large-scale datasets for various document OCR components such as word-level OCR, document layout extraction, and distortion correction; which are available as individual modules in high-resource languages. In this paper, we introduce this http URL-BRACU-OCR (bbOCR): an open-source scalable document OCR system that can reconstruct Bengali documents into a structured searchable digitized format that leverages a novel Bengali text recognition model and two novel synthetic datasets. We present extensive component-level and system-level evaluation: both use a novel diversified evaluation dataset and comprehensive evaluation metrics. Our extensive evaluation suggests that our proposed solution is preferable over the current state-of-the-art Bengali OCR systems. The source codes and datasets are available here: this https URL.
尽管存在许多Optical Character Recognition (OCR)工具,但缺乏 comprehensive 开源系统阻碍了各种低资源语言,包括孟加拉语的文档数字化进展。低资源语言,特别是那些采用 alphasyllabary 书写系统的语言,面临着缺乏各种文档 OCR 组件的大型数据集,例如字级别的 OCR、文档布局提取和失真纠正,这些数据集在高资源语言中是独立的模块。在本文中,我们介绍了这个 http URL-BRACU-OCR (bbOCR):一个开源可扩展的文档 OCR 系统,可以将孟加拉语文档重构为具有结构搜索digitization格式,利用一个独特的孟加拉文本识别模型和两个独特的合成数据集。我们提供了广泛的组件级别和系统级别的评估:使用了一个独特的多样化的评估数据集和一个全面的评估指标。我们的广泛评估表明,我们提出的解决方案优于当前孟加拉语 OCR 系统的现有技术水平。源代码和数据集在这里:这个 https URL。
https://arxiv.org/abs/2308.10647
Language models are useful adjuncts to optical models for producing accurate optical character recognition (OCR) results. One factor which limits the power of language models in this context is the existence of many specialized domains with language statistics very different from those implied by a general language model - think of checks, medical prescriptions, and many other specialized document classes. This paper introduces an algorithm for efficiently generating and attaching a domain specific word based language model at run time to a general language model in an OCR system. In order to best use this model the paper also introduces a modified CTC beam search decoder which effectively allows hypotheses to remain in contention based on possible future completion of vocabulary words. The result is a substantial reduction in word error rate in recognizing material from specialized domains.
语言模型是光学模型的有效补充,用于产生准确的光学字符识别(OCR)结果。在此背景下,限制语言模型威力的一个因素是存在许多专业化领域,其语言统计与一般语言模型所暗示的有所不同 - 例如,检查、医学处方和其他许多专业化文档类别。本文介绍了一种算法,可以在运行时高效生成并附加专业化领域的词汇基于语言模型,并将其与OCR系统中的一般语言模型相结合。为了最好地利用这个模型,本文还介绍了一种 modified CTC beam search 解码器,它 effectively allows hypotheses to remain in contention based on possible future completion of vocabulary words。结果是,从专业化领域识别材料时,单词错误率得到了显著降低。
https://arxiv.org/abs/2308.09671
In a challenge-response study, we subjected Google Bard to 64 visual challenges designed to probe multimodal Large Language Models (LLMs). The challenges spanned diverse categories, including "Visual Situational Reasoning," "Visual Text Reasoning," and "Next Scene Prediction," among others, to discern Bard's competence in melding visual and linguistic analyses. Our findings indicate that Bard tends to rely on making educated guesses about visuals, especially when determining cues from images. Unlike other models like GPT4, Bard does not appear to rely on optical character recognition libraries like Tesseract but recognizes text in complex images like deep learning models such as Google Lens and Visual API. Significantly Bard can solve CAPTCHAs visually that ChatGPT fails to understand, recommending Tesseract solutions. Moreover, while the Bard model proposes solutions based on visual input, it cannot recreate or modify the original visual objects to support its conclusions. Bard fails to redraw ASCII art that the text can describe or capture a simple Tic Tac Toe grid it claims to analyze for the next moves. This study provides experimental insights into the current capacities and areas for improvement in multimodal LLMs.
在一项挑战响应研究中,我们向谷歌 Bard 提出了 64 种视觉挑战,旨在测试多模态大型语言模型(LLM)。这些挑战跨越多个类别,包括“视觉情境推理”、“视觉文本推理”和“未来场景预测”,以评估 Bard 在融合视觉和语言分析方面的能力和水平。我们的发现表明, Bard 倾向于依赖对视觉的猜测,特别是在从图像中确定线索时。与 GPT4 等其他模型不同, Bard 似乎不依赖像 Tesseract 这样的光学字符识别库,而是像 Google Lens 和 Visual API 这样的深度学习模型一样,在复杂的图像中识别文本。 significantly, Bard 能够视觉解决 ChatGPT 无法理解的CAPTCHA,并推荐 Tesseract 解决方案。此外,虽然 Bard 模型基于视觉输入提出解决方案,但它无法重新创造或修改原始视觉对象来支持其结论。 Bard 无法绘制出文本能够描述或捕捉的简单的 tic Tac Toe 棋盘。这项研究提供了实验 insights into the current capabilities and areas for improvement in multimodalLLM。
https://arxiv.org/abs/2309.16705
This paper presents OmniDataComposer, an innovative approach for multimodal data fusion and unlimited data generation with an intent to refine and uncomplicate interplay among diverse data modalities. Coming to the core breakthrough, it introduces a cohesive data structure proficient in processing and merging multimodal data inputs, which include video, audio, and text. Our crafted algorithm leverages advancements across multiple operations such as video/image caption extraction, dense caption extraction, Automatic Speech Recognition (ASR), Optical Character Recognition (OCR), Recognize Anything Model(RAM), and object tracking. OmniDataComposer is capable of identifying over 6400 categories of objects, substantially broadening the spectrum of visual information. It amalgamates these diverse modalities, promoting reciprocal enhancement among modalities and facilitating cross-modal data correction. \textbf{The final output metamorphoses each video input into an elaborate sequential document}, virtually transmuting videos into thorough narratives, making them easier to be processed by large language models. Future prospects include optimizing datasets for each modality to encourage unlimited data generation. This robust base will offer priceless insights to models like ChatGPT, enabling them to create higher quality datasets for video captioning and easing question-answering tasks based on video content. OmniDataComposer inaugurates a new stage in multimodal learning, imparting enormous potential for augmenting AI's understanding and generation of complex, real-world data.
本文介绍了OmniDataComposer,一种多模式数据融合和无限数据生成的创新方法,旨在 refine 和简化不同数据模式之间的交互。其核心突破是引入了一种擅长处理和合并多模式数据输入的一致性数据结构。我们的设计算法利用了许多操作方面的最新进展,例如视频/图像标题提取、稠密标题提取、自动语音识别(ASR)、光学字符识别(OCR)、识别任何模型(RAM)和对象跟踪。 OmniDataComposer能够识别超过6,400种对象类别,极大地扩展了视觉信息的范围。它将这些不同的数据模式合并在一起,促进不同数据模式的相互增强,并方便地进行跨模式数据修正。 \textbf{最终的输出将每个视频输入转换为详细的顺序文件,实际上将视频转换为完整的故事,使其更容易被大型语言模型处理。}。未来前景包括优化每个数据模式的数据集,鼓励无限数据生成。这个坚固的基础将向 ChatGPT 等模型提供无价 insights,使它们能够创建高质量的视频标题数据集,并基于视频内容 ease 问答任务。 OmniDataComposer开创了多模式学习的新阶段,提供了巨大的潜力,以增强 AI 对复杂、实际数据的理解和生成。
https://arxiv.org/abs/2308.04126