Currently, a substantial volume of document data exists in an unstructured format, encompassing Portable Document Format (PDF) files and images. Extracting information from these documents presents formidable challenges due to diverse table styles, complex forms, and the inclusion of different languages. Several open-source toolkits, such as Camelot, Plumb a PDF (pdfnumber), and Paddle Paddle Structure V2 (PP-StructureV2), have been developed to facilitate table extraction from PDFs or images. However, each toolkit has its limitations. Camelot and pdfnumber can solely extract tables from digital PDFs and cannot handle image-based PDFs and pictures. On the other hand, PP-StructureV2 can comprehensively extract image-based PDFs and tables from pictures. Nevertheless, it lacks the ability to differentiate between diverse application scenarios, such as wired tables and wireless tables, digital PDFs, and image-based PDFs. To address these issues, we have introduced the PDF table extraction (PdfTable) toolkit. This toolkit integrates numerous open-source models, including seven table recognition models, four Optical character recognition (OCR) recognition tools, and three layout analysis models. By refining the PDF table extraction process, PdfTable achieves adaptability across various application scenarios. We substantiate the efficacy of the PdfTable toolkit through verification on a self-labeled wired table dataset and the open-source wireless Publicly Table Reconition Dataset (PubTabNet). The PdfTable code will available on Github: this https URL.
目前,大量的文档数据以非结构化方式存在,包括可移动文档格式(PDF)文件和图像。从这些文档中提取信息因不同的表格样式、复杂的表格和不同语言的存在而带来了巨大的挑战。为了解决这个问题,已经开发了几种开源工具包,如Camelot、Plumb和Paddle Paddle Structure V2(PP-StructureV2),以帮助从PDF或图像中提取表格。然而,每个工具包都有其局限性。Camelot和pdfnumber只能从数字PDF中提取表格,而不能处理基于图像的PDF和图片。另一方面,PP-StructureV2可以全面提取基于图像的PDF和表格。然而,它缺乏区分不同应用场景的能力,例如有线表格和无线表格、数字PDF和基于图像的PDF。为解决这些问题,我们引入了PDF表格提取(PdfTable)工具包。这个工具包整合了多个开源模型,包括七个表格识别模型、四个光学字符识别(OCR)识别工具和三个排版分析模型。通过优化PDF表格提取过程,PdfTable在各种应用场景中实现可扩展性。我们通过验证自标签的有线表格数据集和开源无线公开表格重建数据集(PubTabNet)来证实PdfTable工具包的有效性。PdfTable代码将在Github上发布:https://this URL。
https://arxiv.org/abs/2409.05125
In recent years, vision-language models have made significant strides, excelling in tasks like optical character recognition and geometric problem-solving. However, several critical issues remain: 1) Proprietary models often lack transparency about their architectures, while open-source models need more detailed ablations of their training strategies. 2) Pre-training data in open-source works is under-explored, with datasets added empirically, making the process cumbersome. 3) Fine-tuning often focuses on adding datasets, leading to diminishing returns. To address these issues, we propose the following contributions: 1) We trained a robust baseline model using the latest advancements in vision-language models, introducing effective improvements and conducting comprehensive ablation and validation for each technique. 2) Inspired by recent work on large language models, we filtered pre-training data using perplexity, selecting the lowest perplexity data for training. This approach allowed us to train on a curated 1M dataset, achieving competitive performance. 3) During visual instruction tuning, we used model soup on different datasets when adding more datasets yielded marginal improvements. These innovations resulted in a 9B parameter model that performs competitively with state-of-the-art models. Our strategies are efficient and lightweight, making them easily adoptable by the community.
近年来,视觉语言模型取得了显著的进步,在诸如光学字符识别和几何问题求解等任务中表现出色。然而,仍然存在几个关键问题:1)专有模型通常缺乏透明度,而开源模型需要更详细的训练策略的抽象。2)开源作品中预训练数据的发掘程度不高,数据是通过经验添加的,使得整个过程变得繁琐。3)微调通常专注于增加数据集,导致回报递减。为解决这些问题,我们提出了以下贡献:1)我们使用最新的视觉语言模型改进训练了一个健壮的基准模型,引入了有效的改进并进行了全面的技术消融和验证。2)受到大语言模型最近工作的启发,我们使用皮尔比(perplexity)对预训练数据进行筛选,选择最低皮尔比的数据进行训练。这种方法使我们能够在精心挑选的1M个数据集上进行训练,实现与最先进模型的竞争性能。3)在视觉指令微调期间,我们在添加更多数据集时使用模型 soup。这些创新使得我们在不同的数据集上获得了微小的改进。这些改进导致了一个9B参数的模型,与最先进模型的性能相当。我们的策略既高效又轻便,这使得它们很容易为社区所接受。
https://arxiv.org/abs/2409.04828
Optical Character Recognition (OCR) continues to face accuracy challenges that impact subsequent applications. To address these errors, we explore the utility of OCR confidence scores for enhancing post-OCR error detection. Our study involves analyzing the correlation between confidence scores and error rates across different OCR systems. We develop ConfBERT, a BERT-based model that incorporates OCR confidence scores into token embeddings and offers an optional pre-training phase for noise adjustment. Our experimental results demonstrate that integrating OCR confidence scores can enhance error detection capabilities. This work underscores the importance of OCR confidence scores in improving detection accuracy and reveals substantial disparities in performance between commercial and open-source OCR technologies.
光学字符识别(OCR)继续面临着准确性挑战,这些挑战会影响后续应用。为解决这些问题,我们探讨了OCR置信分数在提高后OCR错误检测中的应用价值。我们的研究涉及分析不同OCR系统之间置信分数与错误率之间的相关性。我们开发了ConfBERT,一种基于BERT的模型,它将OCR置信分数融入词向量表示中,并提供了可选的噪声调整预训练阶段。我们的实验结果表明,将OCR置信分数集成到系统中可以提高错误检测能力。这项工作强调了OCR置信分数在提高检测准确性的重要性,并揭示了商业和开源OCR技术之间的巨大性能差异。
https://arxiv.org/abs/2409.04117
Traditional OCR systems (OCR-1.0) are increasingly unable to meet people's usage due to the growing demand for intelligent processing of man-made optical characters. In this paper, we collectively refer to all artificial optical signals (e.g., plain texts, math/molecular formulas, tables, charts, sheet music, and even geometric shapes) as "characters" and propose the General OCR Theory along with an excellent model, namely GOT, to promote the arrival of OCR-2.0. The GOT, with 580M parameters, is a unified, elegant, and end-to-end model, consisting of a high-compression encoder and a long-contexts decoder. As an OCR-2.0 model, GOT can handle all the above "characters" under various OCR tasks. On the input side, the model supports commonly used scene- and document-style images in slice and whole-page styles. On the output side, GOT can generate plain or formatted results (markdown/tikz/smiles/kern) via an easy prompt. Besides, the model enjoys interactive OCR features, i.e., region-level recognition guided by coordinates or colors. Furthermore, we also adapt dynamic resolution and multi-page OCR technologies to GOT for better practicality. In experiments, we provide sufficient results to prove the superiority of our model.
传统的光学字符识别(OCR-1.0)系统由于人们对人工智能处理人造光学字符的需求不断增加,越来越难以满足人们的需求。在本文中,我们统称为所有人工光学信号(例如纯文本、数学/分子式、表格、图表、乐谱甚至几何形状)为“字符”,并提出了一个卓越的模型——GOT,以促进OCR-2.0的到来。GOT具有580M参数,是一个统一、优雅、端到端的模型,由高压缩编码器和一个长上下文解码器组成。作为OCR-2.0模型,GOT可以处理各种OCR任务。在输入方面,模型支持常见的场景和文档风格的图像,以切片和整页方式。在输出方面,GOT可以通过一个简单的提示生成纯文本或格式化的结果(Markdown/Tikz/smiles/Kern)。此外,该模型还具有交互式OCR功能,即基于坐标或颜色引导的区域级识别。此外,我们还为GOT适应动态分辨率的多页OCR技术。在实验中,我们提供了足够的结果来证明我们模型的优越性。
https://arxiv.org/abs/2409.01704
The digitization of historical documents is crucial for preserving the cultural heritage of the society. An important step in this process is converting scanned images to text using Optical Character Recognition (OCR), which can enable further search, information extraction, etc. Unfortunately, this is a hard problem as standard OCR tools are not tailored to deal with historical orthography as well as with challenging layouts. Thus, it is standard to apply an additional text correction step on the OCR output when dealing with such documents. In this work, we focus on Bulgarian, and we create the first benchmark dataset for evaluating the OCR text correction for historical Bulgarian documents written in the first standardized Bulgarian orthography: the Drinov orthography from the 19th century. We further develop a method for automatically generating synthetic data in this orthography, as well as in the subsequent Ivanchev orthography, by leveraging vast amounts of contemporary literature Bulgarian texts. We then use state-of-the-art LLMs and encoder-decoder framework which we augment with diagonal attention loss and copy and coverage mechanisms to improve the post-OCR text correction. The proposed method reduces the errors introduced during recognition and improves the quality of the documents by 25\%, which is an increase of 16\% compared to the state-of-the-art on the ICDAR 2019 Bulgarian dataset. We release our data and code at \url{this https URL}.}
历史文献的数字化对于保留社会文化遗产至关重要。这个过程的一个重要步骤是将扫描图像使用光学字符识别(OCR)转换为文本,这将允许进行进一步的搜索、信息提取等。然而,这是一个困难的问题,因为标准的OCR工具并不专门为处理历史拼写以及具有挑战性的布局而设计。因此,在处理这类文档时,通常会在OCR输出上应用一个额外的文本修正步骤。在这项工作中,我们专注于保加利亚语,并创建了评估OCR对第一标准化保加利亚语历史文献的文本修正的第一个基准数据集:19世纪Drinov拼写的文献。我们进一步开发了一种在這種拼写以及随后的Ivanchev拼写中自动生成合成数据的方法,通过利用大量当代保加利亚语文本。然后,我们使用最先进的LLMs和编码器-解码器框架,通过增加梯度注意损失和覆盖机制来增强后OCR文本修正。所提出的方法减少了识别过程中引入的错误,并将文档的质量提高了25\%,与ICDAR 2019年保加利亚数据集中的最先进方法相比,提高了16\%。我们还将数据和代码发布在\url{这个链接}上。
https://arxiv.org/abs/2409.00527
The digitisation of historical print media archives is crucial for increasing accessibility to contemporary records. However, the process of Optical Character Recognition (OCR) used to convert physical records to digital text is prone to errors, particularly in the case of newspapers and periodicals due to their complex layouts. This paper introduces Context Leveraging OCR Correction (CLOCR-C), which utilises the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality. The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing the socio-cultural context as part of the correction process. Experiments were conducted using seven LMs on three datasets: the 19th Century Serials Edition (NCSE) and two datasets from the Overproof collection. The results demonstrate that some LMs can significantly reduce error rates, with the top-performing model achieving over a 60% reduction in character error rate on the NCSE dataset. The OCR improvements extend to downstream tasks, such as Named Entity Recognition, with increased Cosine Named Entity Similarity. Furthermore, the study shows that providing socio-cultural context in the prompts improves performance, while misleading prompts lower performance. In addition to the findings, this study releases a dataset of 91 transcribed articles from the NCSE, containing a total of 40 thousand words, to support further research in this area. The findings suggest that CLOCR-C is a promising approach for enhancing the quality of existing digital archives by leveraging the socio-cultural information embedded in the LMs and the text requiring correction.
历史印刷媒体档案的数字化对于提高当代记录的可用性至关重要。然而,将物理记录转换为数字文本的光学字符识别(OCR)过程容易出错,特别是在报纸和期刊等复杂排版的情况下。本文介绍了一种名为Context Leveraging OCR Correction(CLOCR-C)的方法,它利用了基于Transformer的语言模型(LMs)的填充和上下文自适应能力来提高OCR质量。研究旨在确定LMs是否能在OCR后进行更正,提高下游自然语言处理(NLP)任务的效果,以及提供社会文化背景作为更正过程的一部分的价值。实验在三个数据集上进行:19世纪期刊系列(NCSE)和Overproof收藏中的两个数据集。结果表明,一些LMS可以显著降低错误率,最高性能的模型在NCSE数据集上的字符错误率降低了60%以上。OCR改进也延伸到下游任务,如命名实体识别,增强的余弦命名实体相似性。此外,研究还发现,在提示中提供社会文化背景可以提高性能,而误导性的提示会降低性能。除了这些发现之外,这项研究还释放了一个包含91篇从NCSE收集的转录文章的91篇文章的数据集,共计40,000个单词,以支持该领域进一步的研究。这些发现表明,CLOCR-C是通过利用LMs中嵌入的社会文化信息和需要更正的文本来提高现有数字档案质量的有前途的方法。
https://arxiv.org/abs/2408.17428
The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. A number of recent MLLMs achieve this goal using a mixture of vision encoders. Despite their success, there is a lack of systematic comparisons and detailed ablation studies addressing critical aspects, such as expert selection and the integration of multiple vision experts. This study provides an extensive exploration of the design space for MLLMs using a mixture of vision encoders and resolutions. Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach. We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies. We additionally introduce Pre-Alignment to bridge the gap between vision-focused encoders and language tokens, enhancing model coherence. The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks. Models and code: this https URL
准确地解释复杂视觉信息是一个多模态大型语言模型(MLLM)的重要话题。最近的工作表明,增强视觉感知在显著减少幻觉和提高分辨率敏感任务(如光学字符识别和文档分析)的表现方面至关重要。许多最近的MLLM使用混合视觉编码器来实现这一目标。尽管它们取得了成功,但缺乏系统性的比较和详细消融研究解决了一些关键方面,如专家选择和多视觉专家的集成。 本研究对使用混合视觉编码器的设计空间进行了深入探索,以设计MLLM。我们的发现揭示了各种现有策略背后的共同原理,从而实现了简洁有效的设计方法。我们发现,简单地将来自互补视觉编码器的视觉标记串联起来与更复杂的混合架构或策略相当有效。此外,我们还引入了预对齐来连接视觉关注编码器和语言标记,增强了模型的连贯性。 所得到的一组MLLM家族,Eagle,在主要MLLM基准测试中超过了其他领先的开源模型。模型和代码:这个链接
https://arxiv.org/abs/2408.15998
Most production-level deployments for Visual Question Answering (VQA) tasks are still build as processing pipelines of independent steps including image pre-processing, object- and text detection, Optical Character Recognition (OCR) and (mostly supervised) object classification. However, the recent advances in vision Foundation Models [25] and Vision Language Models (VLMs) [23] raise the question if these custom trained, multi-step approaches can be replaced with pre-trained, single-step VLMs. This paper analyzes the performance and limits of various VLMs in the context of VQA and OCR [5, 9, 12] tasks in a production-level scenario. Using data from the Retail-786k [10] dataset, we investigate the capabilities of pre-trained VLMs to answer detailed questions about advertised products in images. Our study includes two commercial models, GPT-4V [16] and GPT-4o [17], as well as four open-source models: InternVL [5], LLaVA 1.5 [12], LLaVA-NeXT [13], and CogAgent [9]. Our initial results show, that there is in general no big performance gap between open-source and commercial models. However, we observe a strong task dependent variance in VLM performance: while most models are able to answer questions regarding the product brand and price with high accuracy, they completely fail at the same time to correctly identity the specific product name or discount. This indicates the problem of VLMs to solve fine-grained classification tasks as well to model the more abstract concept of discounts.
大多数生产级别的视觉问答(VQA)任务的部署仍然是构建处理流程的独立步骤,包括图像预处理、目标检测、光学字符识别(OCR)和(主要是监督)目标分类。然而,近期在视觉基础模型[25]和视觉语言模型(VLMs)[23]方面的进展引发了一个问题,即这些自定义训练、多步骤方法是否可以被替换为预训练、单步骤VLMs。本文在生产级别场景下分析各种VLMs的性能和局限性[5, 9, 12]。使用零售786k[10]数据集,我们研究了预训练VLMs在回答图像中广告产品的详细问题方面的能力。我们的研究包括两个商业模型GPT-4V[16]和GPT-4o[17],以及四个开源模型:InternVL[5],LLaVA 1.5[12],LLaVA-NeXT[13]和CogAgent[9]。我们最初的结果表明,开源模型和商业模型之间的性能差距通常并不大。然而,我们观察到VLM性能的强烈任务相关方差:虽然大多数模型能够高精度地回答关于产品品牌和价格的问题,但它们同时完全无法正确识别具体产品名称或折扣。这表明VLMs在解决细粒度分类任务和建模折扣更抽象概念方面存在问题。
https://arxiv.org/abs/2408.15626
This research paper introduces an innovative word-level Optical Character Recognition (OCR) model specifically designed for digital Urdu text recognition. Utilizing transformer-based architectures and attention mechanisms, the model was trained on a comprehensive dataset of approximately 160,000 Urdu text images, achieving a character error rate (CER) of 0.178, which highlights its superior accuracy in recognizing Urdu characters. The model's strength lies in its unique architecture, incorporating the permuted autoregressive sequence (PARSeq) model, which allows for context-aware inference and iterative refinement by leveraging bidirectional context information to enhance recognition accuracy. Furthermore, its capability to handle a diverse range of Urdu text styles, fonts, and variations enhances its applicability in real-world scenarios. Despite its promising results, the model has some limitations, such as difficulty with blurred images, non-horizontal orientations, and overlays of patterns, lines, or other text, which can occasionally lead to suboptimal performance. Additionally, trailing or following punctuation marks can introduce noise into the recognition process. Addressing these challenges will be a focus of future research, aiming to refine the model further, explore data augmentation techniques, optimize hyperparameters, and integrate contextual improvements for more accurate and efficient Urdu text recognition.
本文提出了一种专门针对数字乌尔都语文本识别的创新词级光学字符识别(OCR)模型。该模型利用变压器架构和注意力机制在综合数据集上训练,约160,000个乌尔都语文本图像,实现了一个字符错误率(CER)为0.178,这表明其在识别乌尔都字符方面的优越性。该模型的优势在于其独特的架构,包括通过使用可变自回归序列(PARSeq)模型实现上下文感知推理和迭代改进,以便利用双向上下文信息提高识别准确性。此外,它能处理各种乌尔都语文本风格、字体和变化,使其在现实场景中具有更广泛的适用性。尽管其取得了很好的效果,但该模型仍存在一些局限性,例如对模糊图像处理困难,不规则的旋转方向,以及图形的叠加,这些偶尔会导致性能最优。此外,尾随或紧跟标点符号可能会引入噪声到识别过程。解决这些挑战将是未来研究的重点,旨在进一步优化模型,探索数据增强技术,优化超参数,并实现对乌尔都语文本识别的上下文改进,使其更加准确和高效。
https://arxiv.org/abs/2408.15119
Despite significant advancements in License Plate Recognition (LPR) through deep learning, most improvements rely on high-resolution images with clear characters. This scenario does not reflect real-world conditions where traffic surveillance often captures low-resolution and blurry images. Under these conditions, characters tend to blend with the background or neighboring characters, making accurate LPR challenging. To address this issue, we introduce a novel loss function, Layout and Character Oriented Focal Loss (LCOFL), which considers factors such as resolution, texture, and structural details, as well as the performance of the LPR task itself. We enhance character feature learning using deformable convolutions and shared weights in an attention module and employ a GAN-based training approach with an Optical Character Recognition (OCR) model as the discriminator to guide the super-resolution process. Our experimental results show significant improvements in character reconstruction quality, outperforming two state-of-the-art methods in both quantitative and qualitative measures. Our code is publicly available at this https URL
尽管通过深度学习在车牌识别(LPR)方面取得了显著的进步,但大多数改进都依赖于高分辨率、清晰字符的图像。这种情况并不代表实际情况,因为交通监控通常捕捉低分辨率、模糊的图像。在这种情况下,字符往往与背景或相邻字符融合在一起,使得准确的车牌识别(LPR)变得具有挑战性。为了解决这个问题,我们引入了一个新的损失函数——布局和字符定向焦点损失(LCOFL),它考虑了分辨率、纹理和结构细节等因素,以及LPR任务的性能本身。我们通过变形卷积和注意力模块中的共享权重来提高字符特征学习,并使用基于GAN的训练方法,其中OCR模型作为判别器来引导超分辨率过程。我们的实验结果表明,在定量和定性指标上,车牌重建质量都有显著的提高,超越了两个最先进的解决方案。我们的代码公开可用,在这个https URL。
https://arxiv.org/abs/2408.15103
Optical Music Recognition (OMR) automates the transcription of musical notation from images into machine-readable formats like MusicXML, MEI, or MIDI, significantly reducing the costs and time of manual transcription. This study explores knowledge discovery in OMR by applying instance segmentation using Mask R-CNN to enhance the detection and delineation of musical symbols in sheet music. Unlike Optical Character Recognition (OCR), OMR must handle the intricate semantics of Common Western Music Notation (CWMN), where symbol meanings depend on shape, position, and context. Our approach leverages instance segmentation to manage the density and overlap of musical symbols, facilitating more precise information retrieval from music scores. Evaluations on the DoReMi and MUSCIMA++ datasets demonstrate substantial improvements, with our method achieving a mean Average Precision (mAP) of up to 59.70\% in dense symbol environments, achieving comparable results to object detection. Furthermore, using traditional computer vision techniques, we add a parallel step for staff detection to infer the pitch for the recognised symbols. This study emphasises the role of pixel-wise segmentation in advancing accurate music symbol recognition, contributing to knowledge discovery in OMR. Our findings indicate that instance segmentation provides more precise representations of musical symbols, particularly in densely populated scores, advancing OMR technology. We make our implementation, pre-processing scripts, trained models, and evaluation results publicly available to support further research and development.
光音乐识别(OMR)自动化了将音乐乐谱图像从图像中转录为机器可读格式的过程,从而显著减少了手动转录的成本和时间。本研究探讨了OMR通过应用Mask R-CNN实例分割来增强乐谱中音乐符号的检测和绘制。与光学字符识别(OCR)不同,OMR必须处理共同西方音乐乐谱(CWMN)中复杂的语义,其中符号的意义取决于形状、位置和上下文。我们的方法利用实例分割来管理音乐符号的密度和重叠,促进更精确的信息检索从乐谱中。在DoReMi和MUSCIMA+数据集上的评估表明,我们的方法取得了显著的改进,平均精度(mAP)可以达到59.70%。我们的方法与物体检测相当,进一步证明了像素级分割在提高OMR技术准确音乐符号识别中的作用。本研究强调了在推动OMR技术的发展中,像素级分割的重要性。我们的研究结果表明,实例分割提供了更精确的音乐符号表示,特别是在密集符号的乐谱中,推动了OMR技术的发展。我们将我们的实现、预处理脚本、训练模型和评估结果公开发布,以支持进一步研究和开发。
https://arxiv.org/abs/2408.15002
The proliferation of scene text in both structured and unstructured environments presents significant challenges in optical character recognition (OCR), necessitating more efficient and robust text spotting solutions. This paper presents FastTextSpotter, a framework that integrates a Swin Transformer visual backbone with a Transformer Encoder-Decoder architecture, enhanced by a novel, faster self-attention unit, SAC2, to improve processing speeds while maintaining accuracy. FastTextSpotter has been validated across multiple datasets, including ICDAR2015 for regular texts and CTW1500 and TotalText for arbitrary-shaped texts, benchmarking against current state-of-the-art models. Our results indicate that FastTextSpotter not only achieves superior accuracy in detecting and recognizing multilingual scene text (English and Vietnamese) but also improves model efficiency, thereby setting new benchmarks in the field. This study underscores the potential of advanced transformer architectures in improving the adaptability and speed of text spotting applications in diverse real-world settings. The dataset, code, and pre-trained models have been released in our Github.
场景文本在结构和非结构化环境中的普遍存在给光学字符识别(OCR)带来了巨大的挑战,需要更高效和稳健的文本检测解决方案。本文介绍了一个名为FastTextSpotter的框架,它将Swin Transformer视觉骨干与Transformer Encoder-Decoder架构集成,并新增了一个新的自注意力单元SAC2,以提高处理速度的同时保持准确性。FastTextSpotter已经在多个数据集上进行了验证,包括ICDAR2015(普通文本)和CTW1500(任意形状文本),并与当前最先进的模型进行了比较。我们的结果表明,FastTextSpotter不仅实现了对多语言场景文本(英语和越南语)的卓越检测和识别,还提高了模型的效率,从而在领域内设置了新的基准。本研究突出了高级Transformer架构在提高文本检测应用程序的适应性和速度方面的潜力。数据集、代码和预训练模型已发布在我们的Github上。
https://arxiv.org/abs/2408.14998
Many languages have vast amounts of handwritten texts, such as ancient scripts about folktale stories and historical narratives or contemporary documents and letters. Digitization of those texts has various applications, such as daily tasks, cultural studies, and historical research. Syriac is an ancient, endangered, and low-resourced language that has not received the attention it requires and deserves. This paper reports on a research project aimed at developing a optical character recognition (OCR) model based on the handwritten Syriac texts as a starting point to build more digital services for this endangered language. A dataset was created, KHAMIS (inspired by the East Syriac poet, Khamis bar Qardahe), which consists of handwritten sentences in the East Syriac script. We used it to fine-tune the Tesseract-OCR engine's pretrained Syriac model on handwritten data. The data was collected from volunteers capable of reading and writing in the language to create KHAMIS. KHAMIS currently consists of 624 handwritten Syriac sentences collected from 31 university students and one professor, and it will be partially available online and the whole dataset available in the near future for development and research purposes. As a result, the handwritten OCR model was able to achieve a character error rate of 1.097-1.610% and 8.963-10.490% on both training and evaluation sets, respectively, and both a character error rate of 18.89-19.71% and a word error rate of 62.83-65.42% when evaluated on the test set, which is twice as better than the default Syriac model of Tesseract.
许多语言都有大量的手写文本,如关于民间故事和历史叙述的古代脚本或当代文件和信件。对这些文本的数字化具有各种应用,如日常生活、文化研究和历史研究。Syriac是一种古老、濒危且资源有限的语言,尚未得到其所需和应得的关注。本文报道了一个旨在开发基于手写Syriac文本的光学字符识别(OCR)模型的研究项目,作为为这种濒危语言创建更多数字服务的起点。 一个数据集KHAMIS(受到叙利亚诗人Khamis bar Qardahe的启发)被创建了,它由来自能够阅读和写作该语言的志愿者创建。KHAMIS目前包括来自31名大学生和一名教授的624个手写Syriac句子。该数据集将在未来部分在线发布,整个数据集将在开发和研究过程中发布。 因此,手写OCR模型能够达到训练和评估集上的字符错误率分别为1.097-1.610%和8.963-10.490%,以及测试集上的字符错误率为18.89-19.71%和单词错误率为62.83-65.42%。这比Tesseract默认的Syriac模型还要好,后者的字符错误率和单词错误率分别为1.61-1.70%和27.66-28.51%。
https://arxiv.org/abs/2408.13631
In this report, we introduce Vintern-1B, a reliable 1-billion-parameters multimodal large language model (MLLM) for Vietnamese language tasks. By integrating the Qwen2-0.5B-Instruct language model with the InternViT-300M-448px visual model, Vintern-1B is optimized for a range of applications, including optical character recognition (OCR), document extraction, and general question-answering in Vietnamese context. The model is fine-tuned on an extensive dataset of over 3 million image-question-answer pairs, achieving robust performance and reliable results across multiple Vietnamese language benchmarks like OpenViVQA and ViTextVQA. Vintern-1B is small enough to fit into various on-device applications easily. Additionally, we have open-sourced several Vietnamese vision question answering (VQA) datasets for text and diagrams, created with Gemini 1.5 Flash. Our models are available at: this https URL.
在这份报告中,我们介绍了Vinnter-1B,一个可靠的10亿参数的多模态大型语言模型(MLLM)用于越南语任务。通过将Qwen2-0.5B-Instruct语言模型与InternViT-300M-448px视觉模型相结合,Vinnter-1B被优化用于各种应用,包括光学字符识别(OCR)、文档提取和越南上下文中的通用问题回答。在超过3000万图像-问题对的大型数据集上进行微调,Vinnter-1B在多个越南语言基准测试(如OpenViVQA和ViTextVQA)上实现稳健的性能和可靠的结果。Vinnter-1B足够小,可以轻松地安装到各种设备上。此外,我们还开源了几个用于文本和图形的越南视觉问题回答(VQA)数据集,使用Gemini 1.5 Flash创建。我们的模型可以从:https://这个链接获取。
https://arxiv.org/abs/2408.12480
Page Stream Segmentation (PSS) is an essential prerequisite for automated document processing at scale. However, research progress has been limited by the absence of realistic public benchmarks. This paper works towards addressing this gap by introducing TABME++, an enhanced benchmark featuring commercial Optical Character Recognition (OCR) annotations. We evaluate the performance of large language models (LLMs) on PSS, focusing on decoder-based models fine-tuned with parameter-efficient methods. Our results show that decoder-based LLMs outperform smaller multimodal encoders. Through a review of existing PSS research and datasets, we identify key challenges and advancements in the field. Our findings highlight the key importance of robust OCR, providing valuable insights for the development of more effective document processing systems.
页流分割(PSS)是大规模自动文档处理不可或缺的前提条件。然而,研究受到了缺乏现实公共基准的制约。本文通过引入TABME++,一个包含商业光学字符识别(OCR)注释的增强基准,来解决这一空白。我们评估了大型语言模型(LLMs)在PSS上的性能,重点关注使用参数高效方法微调的解码器基模型。我们的结果表明,解码器基LLMs优于较小的多模态编码器。通过回顾现有的PSS研究和数据集,我们确定了该领域中的关键挑战和进展。我们的发现强调了 robust OCR 的重要性,为开发更有效的文档处理系统提供了宝贵的洞见。
https://arxiv.org/abs/2408.11981
Recent development in Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs) have leverage Attention-based Transformer architectures and achieved superior performance and generalization capabilities. They have since covered extensive areas of traditional learning tasks. For instance, text-based tasks such as text-classification and sequence-labeling, as well as multi-modal tasks like Visual Question Answering (VQA) and Optical Character Recognition (OCR), which were previously addressed using different models, can now be tackled based on one foundation model. Consequently, the training and lightweight fine-tuning of LLMs and MLLMs, especially those based on Transformer architecture, has become particularly important. In recognition of these overwhelming needs, we develop SWIFT, a customizable one-stop infrastructure for large models. With support of over $300+$ LLMs and $50+$ MLLMs, SWIFT stands as the open-source framework that provide the \textit{most comprehensive support} for fine-tuning large models. In particular, it is the first training framework that provides systematic support for MLLMs. In addition to the core functionalities of fine-tuning, SWIFT also integrates post-training processes such as inference, evaluation, and model quantization, to facilitate fast adoptions of large models in various application scenarios. With a systematic integration of various training techniques, SWIFT offers helpful utilities such as benchmark comparisons among different training techniques for large models. For fine-tuning models specialized in agent framework, we show that notable improvements on the ToolBench leader-board can be achieved by training with customized dataset on SWIFT, with an increase of 5.2\%-21.8\% in the Act.EM metric over various baseline models, a reduction in hallucination by 1.6\%-14.1\%, and an average performance improvement of 8\%-17\%.
近年来,大型语言模型(LLMs)和多模态大型语言模型(MLLMs)的发展已经利用了自注意力机制的Transformer架构,并取得了卓越的性能和泛化能力。它们已经覆盖了传统学习任务的广泛领域。例如,文本类任务(如文本分类和序列标签)以及多模态任务(如视觉问答和光学字符识别),以前使用不同的模型来解决,现在可以根据一种基础模型进行处理。因此,特别是基于Transformer架构的LLMs和MLLMs的轻量级训练和微调变得尤为重要。为了满足这些巨大的需求,我们开发了SWIFT,一个可定制的全面基础设施框架。支持超过300个LLM和50个MLLM,SWIFT成为开源框架中提供给大型模型最全面支持的工具。 特别是,它是第一个提供系统支持MLLMs的训练框架。除了微调的核心功能外,SWIFT还集成了训练后的后处理过程,如推理、评估和模型量化,以推动大型模型在各种应用场景中的快速采用。通过集成各种训练技术,SWIFT提供了有用的工具,如在大型模型上的基准比较。对于专注于代理框架的模型,我们证明了在SWIFT上进行自定义数据集训练可以实现显著的改进。在ActEM指标上的提高分别为5.2%至21.8%,减少 hallucination 1.6%至14.1%,以及平均性能提高8%至17%。
https://arxiv.org/abs/2408.05517
With the advent of multi-modal large language models (MLLMs), datasets used for visual question answering (VQA) and referring expression comprehension have seen a resurgence. However, the most popular datasets used to evaluate MLLMs are some of the earliest ones created, and they have many known problems, including extreme bias, spurious correlations, and an inability to permit fine-grained analysis. In this paper, we pioneer evaluating recent MLLMs (LLaVA 1.5, LLaVA-NeXT, BLIP2, InstructBLIP, GPT-4V, and GPT-4o) on datasets designed to address weaknesses in earlier ones. We assess three VQA datasets: 1) TDIUC, which permits fine-grained analysis on 12 question types; 2) TallyQA, which has simple and complex counting questions; and 3) DVQA, which requires optical character recognition for chart understanding. We also study VQDv1, a dataset that requires identifying all image regions that satisfy a given query. Our experiments reveal the weaknesses of many MLLMs that have not previously been reported. Our code is integrated into the widely used LAVIS framework for MLLM evaluation, enabling the rapid assessment of future MLLMs. Project webpage: this https URL
随着多模态大型语言模型的(MLLMs)的出现,用于视觉问题回答(VQA)和参考表达理解的数据集重新涌现。然而,用于评估MLLMs的最具人气数据集是创建最早的一些,它们存在许多已知问题,包括极端偏见,异质相关性和无法允许细粒度分析。在本文中,我们首创评估最新的MLLMs(LLaVA 1.5,LLaVA-NeXT,BLIP2,InstructBLIP,GPT-4V和GPT-4o)在解决早期问题数据集中的弱点的数据集上。我们评估了三个VQA数据集:1)TDIUC,它允许对12个问题类型进行细粒度分析;2)TallyQA,它有简单和复杂计数问题;3)DVQA,它需要光学字符识别来理解图表。我们还研究了VQDv1,一个需要确定满足给定查询的所有图像区域的VQA数据集。我们的实验揭示了许多之前未报道的MLLM的弱点。我们的代码已集成到广泛使用的LAVIS框架中进行MLLM评估,这使得未来MLLM的快速评估成为可能。项目网页:https:// this URL
https://arxiv.org/abs/2408.05334
Teaching Computer Science (CS) by having students write programs by hand on paper has key pedagogical advantages: It allows focused learning and requires careful thinking compared to the use of Integrated Development Environments (IDEs) with intelligent support tools or "just trying things out". The familiar environment of pens and paper also lessens the cognitive load of students with no prior experience with computers, for whom the mere basic usage of computers can be intimidating. Finally, this teaching approach opens learning opportunities to students with limited access to computers. However, a key obstacle is the current lack of teaching methods and support software for working with and running handwritten programs. Optical character recognition (OCR) of handwritten code is challenging: Minor OCR errors, perhaps due to varied handwriting styles, easily make code not run, and recognizing indentation is crucial for languages like Python but is difficult to do due to inconsistent horizontal spacing in handwriting. Our approach integrates two innovative methods. The first combines OCR with an indentation recognition module and a language model designed for post-OCR error correction without introducing hallucinations. This method, to our knowledge, surpasses all existing systems in handwritten code recognition. It reduces error from 30\% in the state of the art to 5\% with minimal hallucination of logical fixes to student programs. The second method leverages a multimodal language model to recognize handwritten programs in an end-to-end fashion. We hope this contribution can stimulate further pedagogical research and contribute to the goal of making CS education universally accessible. We release a dataset of handwritten programs and code to support future research at this https URL
通过让学生手写编程来教授计算机科学(CS)具有关键的教育教学优点:它允许专注学习和需要仔细思考,与使用智能支持工具或“随便试试”的集成开发环境(IDE)相比,后者具有优势。熟悉的纸笔环境还减轻了没有计算机使用经验的学生的认知负担。最后,这种教学方法为有条件的计算机使用的学生打开了学习机会。然而,关键的障碍是,目前缺乏教授和运行手写程序的方法和软件。手写代码的光学字符识别(OCR)具有挑战性:轻微的OCR错误,可能是由于手写风格的差异,很容易使代码无法运行,识别缩进对像Python等语言至关重要,但由于手写中的不一致水平间距,这种识别方法很难进行。我们采用了一种结合OCR和手写代码识别模块以及一个用于后OCR错误修复的语言模型的方法。据我们所知,这是所有手写代码识别系统的佼佼者。它将错误率从现有技术的30%降低到5%,并且在最小化逻辑修复的幻觉的情况下,将幻觉减少到学生程序的程度。第二种方法利用多模态语言模型以端到端的方式识别手写程序。我们希望这个贡献可以激发进一步的教育教学研究,并为使计算机科学教育普遍可用做出贡献。我们在这个网址上发布了手写程序和代码的数据集,以支持未来的研究:https://url.
https://arxiv.org/abs/2408.07220
Scene text retrieval aims to find all images containing the query text from an image gallery. Current efforts tend to adopt an Optical Character Recognition (OCR) pipeline, which requires complicated text detection and/or recognition processes, resulting in inefficient and inflexible retrieval. Different from them, in this work we propose to explore the intrinsic potential of Contrastive Language-Image Pre-training (CLIP) for OCR-free scene text retrieval. Through empirical analysis, we observe that the main challenges of CLIP as a text retriever are: 1) limited text perceptual scale, and 2) entangled visual-semantic concepts. To this end, a novel model termed FDP (Focus, Distinguish, and Prompt) is developed. FDP first focuses on scene text via shifting the attention to the text area and probing the hidden text knowledge, and then divides the query text into content word and function word for processing, in which a semantic-aware prompting scheme and a distracted queries assistance module are utilized. Extensive experiments show that FDP significantly enhances the inference speed while achieving better or competitive retrieval accuracy compared to existing methods. Notably, on the IIIT-STR benchmark, FDP surpasses the state-of-the-art model by 4.37% with a 4 times faster speed. Furthermore, additional experiments under phrase-level and attribute-aware scene text retrieval settings validate FDP's particular advantages in handling diverse forms of query text. The source code will be publicly available at this https URL.
场景文本检索旨在从图像画廊中找到包含查询文本的所有图像。当前的努力倾向于采用光学字符识别(OCR)流程,这需要复杂的文本检测和/或识别过程,导致效率低下和灵活性差。与它们不同,在这项工作中,我们提出了对比语言图像预训练(CLIP)无OCR场景文本检索的内在潜力的探索。通过实证分析,我们观察到CLIP作为文本检索器的主要挑战是:1)文本感知范围有限,2)纠缠的视觉语义概念。为此,我们开发了一个名为FDP(关注、区分和提示)的新模型。FDP首先通过将注意力和文本区域对齐来聚焦场景文本,并探索隐藏的文本知识,然后将查询文本分为内容词和功能词进行处理,其中采用语义提示方案和分心的查询帮助模块。大量的实验证明,FDP在提高推理速度的同时,与现有方法的检索准确性相比较,实现了显著的提高。值得注意的是,在IIIT-STR基准上,FDP超过了最先进的模型,速度快了4倍。此外,在短语级别和属性意识的场景文本检索设置中,进一步的实验证实了FDP在处理各种查询文本形式方面的独特优势。源代码将公开发布在https://这个URL上。
https://arxiv.org/abs/2408.00441
Images are a powerful and immediate vehicle to carry misleading or outright false messages, yet identifying image-based misinformation at scale poses unique challenges. In this paper, we present PIXELMOD, a system that leverages perceptual hashes, vector databases, and optical character recognition (OCR) to efficiently identify images that are candidates to receive soft moderation labels on Twitter. We show that PIXELMOD outperforms existing image similarity approaches when applied to soft moderation, with negligible performance overhead. We then test PIXELMOD on a dataset of tweets surrounding the 2020 US Presidential Election, and find that it is able to identify visually misleading images that are candidates for soft moderation with 0.99% false detection and 2.06% false negatives.
图像是一种强大的传播误导性或完全虚假信息的强大和即时工具,然而,在Twitter上识别基于图像的错误信息具有独特的挑战性。在本文中,我们提出了PIXELMOD系统,该系统利用感知哈希、向量数据库和光学字符识别(OCR)技术,以有效地确定可能获得Twitter上的软圆润修饰标签的图像。我们证明了,当应用于软圆润修饰时,PIXELMOD优于现有的图像相似方法,且性能开销微不足道。然后,我们在一组围绕2020年美国总统选举的推文数据集上测试了PIXELMOD,并发现它能够以0.99%的误检测率和2.06%的误判率识别出视觉上具有误导性的图像,这些图像可能是软圆润修饰的候选者。
https://arxiv.org/abs/2407.20987