In this paper, we create benchmarks and assess the effectiveness of error correction methods for Japanese vouchers in OCR (Optical Character Recognition) systems. It is essential for automation processing to correctly recognize scanned voucher text, such as the company name on invoices. However, perfect recognition is complex due to the noise, such as stamps. Therefore, it is crucial to correctly rectify erroneous OCR results. However, no publicly available OCR error correction benchmarks for Japanese exist, and methods have not been adequately researched. In this study, we measured text recognition accuracy by existing services on Japanese vouchers and developed a post-OCR correction benchmark. Then, we proposed simple baselines for error correction using language models and verified whether the proposed method could effectively correct these errors. In the experiments, the proposed error correction algorithm significantly improved overall recognition accuracy.
在本文中,我们为日本电子券的错误纠正方法创建了基准并评估了其在OCR(光学字符识别)系统中的有效性。对于自动处理,正确识别扫描的电子券文本至关重要,例如发票上的公司名称。然而,由于噪声(如邮票)的存在,完美识别是复杂的。因此,正确校正错误的OCR结果至关重要。然而,目前尚未公开的日本电子券的OCR错误纠正基准存在,且相关方法的研究不足。在本研究中,我们通过现有的日本电子券服务测量了文本识别准确性,并开发了一个后OCR修正基准。然后,我们使用语言模型提出了简单的错误纠正基线,并验证了所提出的方法是否能有效纠正这些错误。在实验中,与所提出的错误纠正算法相比,所提出的算法显著提高了整体识别准确性。
https://arxiv.org/abs/2409.19948
In the digital era, the ability to understand visually rich documents that integrate text, complex layouts, and imagery is critical. Traditional Key Information Extraction (KIE) methods primarily rely on Optical Character Recognition (OCR), which often introduces significant latency, computational overhead, and errors. Current advanced image-to-text approaches, which bypass OCR, typically yield plain text outputs without corresponding vision grounding. In this paper, we introduce STNet (See then Tell Net), a novel end-to-end model designed to deliver precise answers with relevant vision grounding. Distinctively, STNet utilizes a unique <see> token to observe pertinent image areas, aided by a decoder that interprets physical coordinates linked to this token. Positioned at the outset of the answer text, the <see> token allows the model to first see--observing the regions of the image related to the input question--and then tell--providing articulated textual responses. To enhance the model's seeing capabilities, we collect extensive structured table recognition datasets. Leveraging the advanced text processing prowess of GPT-4, we develop the TVG (TableQA with Vision Grounding) dataset, which not only provides text-based Question Answering (QA) pairs but also incorporates precise vision grounding for these pairs. Our approach demonstrates substantial advancements in KIE performance, achieving state-of-the-art results on publicly available datasets such as CORD, SROIE, and DocVQA. The code will also be made publicly available.
在数字时代,理解和视觉丰富文档的能力至关重要。传统的关键词提取(KIE)方法主要依赖于光学字符识别(OCR),这通常引入显著的延迟、计算开销和错误。当前的先进图像到文本方法通常没有相应的视觉支撑产生纯文本输出。在本文中,我们介绍了STNet(See then Tell Net),一种新颖的端到端模型,旨在提供相关视觉支撑的准确答案。与传统方法不同,STNet利用一个独特的<see>标记来观察相关图像区域,并通过一个解码器解释与该标记相关的物理坐标。定位到答案文本的开头,<see>标记允许模型首先观察输入问题相关的图像区域,然后告诉--提供明确的文本回答。为了提高模型的观察能力,我们收集了广泛的结构化表格识别数据集。利用GPT-4的先进文本处理能力,我们开发了TVG(表QA with Vision Grounding)数据集,它不仅提供了基于文本的问答对,还包含了这些对的精确视觉支撑。我们的方法在关键词提取(KIE)性能上取得了显著的进步,在诸如CORD、SROIE和DocVQA等公开可用数据集上实现了最先进的结果。代码也将公开发布。
https://arxiv.org/abs/2409.19573
Programming tutorials in the form of coding screencasts play a crucial role in programming education, serving both novices and experienced developers. However, the video format of these tutorials presents a challenge due to the difficulty of searching for and within videos. Addressing the absence of large-scale and diverse datasets for screencast analysis, we introduce the CodeSCAN dataset. It comprises 12,000 screenshots captured from the Visual Studio Code environment during development, featuring 24 programming languages, 25 fonts, and over 90 distinct themes, in addition to diverse layout changes and realistic user interactions. Moreover, we conduct detailed quantitative and qualitative evaluations to benchmark the performance of Integrated Development Environment (IDE) element detection, color-to-black-and-white conversion, and Optical Character Recognition (OCR). We hope that our contributions facilitate more research in coding screencast analysis, and we make the source code for creating the dataset and the benchmark publicly available on this website.
编程教程以编写屏幕录像的形式非常重要,这对初学者和经验丰富的开发者都至关重要。然而,由于视频格式的难度,编写屏幕录像教程存在一定的挑战。为了解决大规模和多样数据集的缺乏问题,我们介绍了CodeSCAN数据集。该数据集包含从Visual Studio Code开发环境中捕获的12,000个屏幕截图,涵盖了24种编程语言,25种字体以及超过90个独特主题,还包括多样布局变化和真实的用户交互。此外,我们详细评估了集成开发环境(IDE)元素检测、颜色-to-black-and-white转换和光学字符识别(OCR)的性能。我们希望我们的贡献促进更多研究屏幕录像分析,并将创建数据集和基准的源代码公开发布在网站上。
https://arxiv.org/abs/2409.18556
This paper presents a benchmark dataset for aligning lecture videos with corresponding slides and introduces a novel multimodal algorithm leveraging features from speech, text, and images. It achieves an average accuracy of 0.82 in comparison to SIFT (0.56) while being approximately 11 times faster. Using dynamic programming the algorithm tries to determine the optimal slide sequence. The results show that penalizing slide transitions increases accuracy. Features obtained via optical character recognition (OCR) contribute the most to a high matching accuracy, followed by image features. The findings highlight that audio transcripts alone provide valuable information for alignment and are beneficial if OCR data is lacking. Variations in matching accuracy across different lectures highlight the challenges associated with video quality and lecture style. The novel multimodal algorithm demonstrates robustness to some of these challenges, underscoring the potential of the approach.
本文提出了一种用于将讲座视频与相应幻灯片对齐的基准数据集,并引入了一种利用语音、文本和图像特征的新颖多模态算法。与SIFT(0.56)相比,它实现了平均准确率0.82,速度大约快了11倍。通过动态规划算法,该算法试图确定最优幻灯片序列。结果表明,惩罚幻灯片转换可以提高准确率。通过光学字符识别(OCR)获得的特征对高匹配准确率做出了最大贡献,其次是图像特征。这些发现突出了音频转录 alone 提供 valuable information for alignment 以及如果 OCR data 缺乏时,这种方法是有益的。不同讲座之间的匹配准确率差异突出了视频质量和讲座风格所带来的挑战。新颖的多模态算法表明对其中一些挑战具有容错性,并强调了这个方法的未来潜力。
https://arxiv.org/abs/2409.16765
As Vision-Language Models (VLMs) advance, human-centered Assistive Technologies (ATs) for helping People with Visual Impairments (PVIs) are evolving into generalists, capable of performing multiple tasks simultaneously. However, benchmarking VLMs for ATs remains under-explored. To bridge this gap, we first create a novel AT benchmark (@Bench). Guided by a pre-design user study with PVIs, our benchmark includes the five most crucial vision-language tasks: Panoptic Segmentation, Depth Estimation, Optical Character Recognition (OCR), Image Captioning, and Visual Question Answering (VQA). Besides, we propose a novel AT model (@Model) that addresses all tasks simultaneously and can be expanded to more assistive functions for helping PVIs. Our framework exhibits outstanding performance across tasks by integrating multi-modal information, and it offers PVIs a more comprehensive assistance. Extensive experiments prove the effectiveness and generalizability of our framework.
随着Vision-Language Models(VLMs)的发展,为帮助视觉障碍人士(PVIs)的人性化辅助技术(ATs)正在演变成为多任务处理专家,能够同时执行多项任务。然而,对VLMs评估AT仍然是一个未被充分探索的领域。为了填补这一空白,我们首先创建了一个新颖的AT基准(@Bench)。在PVIs的预设计用户研究指导下,我们的基准包括五个最关键的视觉语言任务:全景分割、深度估计、光学字符识别(OCR)、图像标注和视觉问答(VQA)。此外,我们提出了一个新颖的AT模型(@Model),能够同时处理所有任务,并可以扩展为为PVIs提供更多辅助功能的模型。我们的框架通过整合多模态信息在任务之间取得了出色的表现,并为PVIs提供了更全面的帮助。大量实验证实了我们的框架的有效性和普遍性。
https://arxiv.org/abs/2409.14215
Currently, a substantial volume of document data exists in an unstructured format, encompassing Portable Document Format (PDF) files and images. Extracting information from these documents presents formidable challenges due to diverse table styles, complex forms, and the inclusion of different languages. Several open-source toolkits, such as Camelot, Plumb a PDF (pdfnumber), and Paddle Paddle Structure V2 (PP-StructureV2), have been developed to facilitate table extraction from PDFs or images. However, each toolkit has its limitations. Camelot and pdfnumber can solely extract tables from digital PDFs and cannot handle image-based PDFs and pictures. On the other hand, PP-StructureV2 can comprehensively extract image-based PDFs and tables from pictures. Nevertheless, it lacks the ability to differentiate between diverse application scenarios, such as wired tables and wireless tables, digital PDFs, and image-based PDFs. To address these issues, we have introduced the PDF table extraction (PdfTable) toolkit. This toolkit integrates numerous open-source models, including seven table recognition models, four Optical character recognition (OCR) recognition tools, and three layout analysis models. By refining the PDF table extraction process, PdfTable achieves adaptability across various application scenarios. We substantiate the efficacy of the PdfTable toolkit through verification on a self-labeled wired table dataset and the open-source wireless Publicly Table Reconition Dataset (PubTabNet). The PdfTable code will available on Github: this https URL.
目前,大量的文档数据以非结构化方式存在,包括可移动文档格式(PDF)文件和图像。从这些文档中提取信息因不同的表格样式、复杂的表格和不同语言的存在而带来了巨大的挑战。为了解决这个问题,已经开发了几种开源工具包,如Camelot、Plumb和Paddle Paddle Structure V2(PP-StructureV2),以帮助从PDF或图像中提取表格。然而,每个工具包都有其局限性。Camelot和pdfnumber只能从数字PDF中提取表格,而不能处理基于图像的PDF和图片。另一方面,PP-StructureV2可以全面提取基于图像的PDF和表格。然而,它缺乏区分不同应用场景的能力,例如有线表格和无线表格、数字PDF和基于图像的PDF。为解决这些问题,我们引入了PDF表格提取(PdfTable)工具包。这个工具包整合了多个开源模型,包括七个表格识别模型、四个光学字符识别(OCR)识别工具和三个排版分析模型。通过优化PDF表格提取过程,PdfTable在各种应用场景中实现可扩展性。我们通过验证自标签的有线表格数据集和开源无线公开表格重建数据集(PubTabNet)来证实PdfTable工具包的有效性。PdfTable代码将在Github上发布:https://this URL。
https://arxiv.org/abs/2409.05125
In recent years, vision-language models have made significant strides, excelling in tasks like optical character recognition and geometric problem-solving. However, several critical issues remain: 1) Proprietary models often lack transparency about their architectures, while open-source models need more detailed ablations of their training strategies. 2) Pre-training data in open-source works is under-explored, with datasets added empirically, making the process cumbersome. 3) Fine-tuning often focuses on adding datasets, leading to diminishing returns. To address these issues, we propose the following contributions: 1) We trained a robust baseline model using the latest advancements in vision-language models, introducing effective improvements and conducting comprehensive ablation and validation for each technique. 2) Inspired by recent work on large language models, we filtered pre-training data using perplexity, selecting the lowest perplexity data for training. This approach allowed us to train on a curated 1M dataset, achieving competitive performance. 3) During visual instruction tuning, we used model soup on different datasets when adding more datasets yielded marginal improvements. These innovations resulted in a 9B parameter model that performs competitively with state-of-the-art models. Our strategies are efficient and lightweight, making them easily adoptable by the community.
近年来,视觉语言模型取得了显著的进步,在诸如光学字符识别和几何问题求解等任务中表现出色。然而,仍然存在几个关键问题:1)专有模型通常缺乏透明度,而开源模型需要更详细的训练策略的抽象。2)开源作品中预训练数据的发掘程度不高,数据是通过经验添加的,使得整个过程变得繁琐。3)微调通常专注于增加数据集,导致回报递减。为解决这些问题,我们提出了以下贡献:1)我们使用最新的视觉语言模型改进训练了一个健壮的基准模型,引入了有效的改进并进行了全面的技术消融和验证。2)受到大语言模型最近工作的启发,我们使用皮尔比(perplexity)对预训练数据进行筛选,选择最低皮尔比的数据进行训练。这种方法使我们能够在精心挑选的1M个数据集上进行训练,实现与最先进模型的竞争性能。3)在视觉指令微调期间,我们在添加更多数据集时使用模型 soup。这些创新使得我们在不同的数据集上获得了微小的改进。这些改进导致了一个9B参数的模型,与最先进模型的性能相当。我们的策略既高效又轻便,这使得它们很容易为社区所接受。
https://arxiv.org/abs/2409.04828
Optical Character Recognition (OCR) continues to face accuracy challenges that impact subsequent applications. To address these errors, we explore the utility of OCR confidence scores for enhancing post-OCR error detection. Our study involves analyzing the correlation between confidence scores and error rates across different OCR systems. We develop ConfBERT, a BERT-based model that incorporates OCR confidence scores into token embeddings and offers an optional pre-training phase for noise adjustment. Our experimental results demonstrate that integrating OCR confidence scores can enhance error detection capabilities. This work underscores the importance of OCR confidence scores in improving detection accuracy and reveals substantial disparities in performance between commercial and open-source OCR technologies.
光学字符识别(OCR)继续面临着准确性挑战,这些挑战会影响后续应用。为解决这些问题,我们探讨了OCR置信分数在提高后OCR错误检测中的应用价值。我们的研究涉及分析不同OCR系统之间置信分数与错误率之间的相关性。我们开发了ConfBERT,一种基于BERT的模型,它将OCR置信分数融入词向量表示中,并提供了可选的噪声调整预训练阶段。我们的实验结果表明,将OCR置信分数集成到系统中可以提高错误检测能力。这项工作强调了OCR置信分数在提高检测准确性的重要性,并揭示了商业和开源OCR技术之间的巨大性能差异。
https://arxiv.org/abs/2409.04117
Traditional OCR systems (OCR-1.0) are increasingly unable to meet people's usage due to the growing demand for intelligent processing of man-made optical characters. In this paper, we collectively refer to all artificial optical signals (e.g., plain texts, math/molecular formulas, tables, charts, sheet music, and even geometric shapes) as "characters" and propose the General OCR Theory along with an excellent model, namely GOT, to promote the arrival of OCR-2.0. The GOT, with 580M parameters, is a unified, elegant, and end-to-end model, consisting of a high-compression encoder and a long-contexts decoder. As an OCR-2.0 model, GOT can handle all the above "characters" under various OCR tasks. On the input side, the model supports commonly used scene- and document-style images in slice and whole-page styles. On the output side, GOT can generate plain or formatted results (markdown/tikz/smiles/kern) via an easy prompt. Besides, the model enjoys interactive OCR features, i.e., region-level recognition guided by coordinates or colors. Furthermore, we also adapt dynamic resolution and multi-page OCR technologies to GOT for better practicality. In experiments, we provide sufficient results to prove the superiority of our model.
传统的光学字符识别(OCR-1.0)系统由于人们对人工智能处理人造光学字符的需求不断增加,越来越难以满足人们的需求。在本文中,我们统称为所有人工光学信号(例如纯文本、数学/分子式、表格、图表、乐谱甚至几何形状)为“字符”,并提出了一个卓越的模型——GOT,以促进OCR-2.0的到来。GOT具有580M参数,是一个统一、优雅、端到端的模型,由高压缩编码器和一个长上下文解码器组成。作为OCR-2.0模型,GOT可以处理各种OCR任务。在输入方面,模型支持常见的场景和文档风格的图像,以切片和整页方式。在输出方面,GOT可以通过一个简单的提示生成纯文本或格式化的结果(Markdown/Tikz/smiles/Kern)。此外,该模型还具有交互式OCR功能,即基于坐标或颜色引导的区域级识别。此外,我们还为GOT适应动态分辨率的多页OCR技术。在实验中,我们提供了足够的结果来证明我们模型的优越性。
https://arxiv.org/abs/2409.01704
The digitization of historical documents is crucial for preserving the cultural heritage of the society. An important step in this process is converting scanned images to text using Optical Character Recognition (OCR), which can enable further search, information extraction, etc. Unfortunately, this is a hard problem as standard OCR tools are not tailored to deal with historical orthography as well as with challenging layouts. Thus, it is standard to apply an additional text correction step on the OCR output when dealing with such documents. In this work, we focus on Bulgarian, and we create the first benchmark dataset for evaluating the OCR text correction for historical Bulgarian documents written in the first standardized Bulgarian orthography: the Drinov orthography from the 19th century. We further develop a method for automatically generating synthetic data in this orthography, as well as in the subsequent Ivanchev orthography, by leveraging vast amounts of contemporary literature Bulgarian texts. We then use state-of-the-art LLMs and encoder-decoder framework which we augment with diagonal attention loss and copy and coverage mechanisms to improve the post-OCR text correction. The proposed method reduces the errors introduced during recognition and improves the quality of the documents by 25\%, which is an increase of 16\% compared to the state-of-the-art on the ICDAR 2019 Bulgarian dataset. We release our data and code at \url{this https URL}.}
历史文献的数字化对于保留社会文化遗产至关重要。这个过程的一个重要步骤是将扫描图像使用光学字符识别(OCR)转换为文本,这将允许进行进一步的搜索、信息提取等。然而,这是一个困难的问题,因为标准的OCR工具并不专门为处理历史拼写以及具有挑战性的布局而设计。因此,在处理这类文档时,通常会在OCR输出上应用一个额外的文本修正步骤。在这项工作中,我们专注于保加利亚语,并创建了评估OCR对第一标准化保加利亚语历史文献的文本修正的第一个基准数据集:19世纪Drinov拼写的文献。我们进一步开发了一种在這種拼写以及随后的Ivanchev拼写中自动生成合成数据的方法,通过利用大量当代保加利亚语文本。然后,我们使用最先进的LLMs和编码器-解码器框架,通过增加梯度注意损失和覆盖机制来增强后OCR文本修正。所提出的方法减少了识别过程中引入的错误,并将文档的质量提高了25\%,与ICDAR 2019年保加利亚数据集中的最先进方法相比,提高了16\%。我们还将数据和代码发布在\url{这个链接}上。
https://arxiv.org/abs/2409.00527
The digitisation of historical print media archives is crucial for increasing accessibility to contemporary records. However, the process of Optical Character Recognition (OCR) used to convert physical records to digital text is prone to errors, particularly in the case of newspapers and periodicals due to their complex layouts. This paper introduces Context Leveraging OCR Correction (CLOCR-C), which utilises the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality. The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing the socio-cultural context as part of the correction process. Experiments were conducted using seven LMs on three datasets: the 19th Century Serials Edition (NCSE) and two datasets from the Overproof collection. The results demonstrate that some LMs can significantly reduce error rates, with the top-performing model achieving over a 60% reduction in character error rate on the NCSE dataset. The OCR improvements extend to downstream tasks, such as Named Entity Recognition, with increased Cosine Named Entity Similarity. Furthermore, the study shows that providing socio-cultural context in the prompts improves performance, while misleading prompts lower performance. In addition to the findings, this study releases a dataset of 91 transcribed articles from the NCSE, containing a total of 40 thousand words, to support further research in this area. The findings suggest that CLOCR-C is a promising approach for enhancing the quality of existing digital archives by leveraging the socio-cultural information embedded in the LMs and the text requiring correction.
历史印刷媒体档案的数字化对于提高当代记录的可用性至关重要。然而,将物理记录转换为数字文本的光学字符识别(OCR)过程容易出错,特别是在报纸和期刊等复杂排版的情况下。本文介绍了一种名为Context Leveraging OCR Correction(CLOCR-C)的方法,它利用了基于Transformer的语言模型(LMs)的填充和上下文自适应能力来提高OCR质量。研究旨在确定LMs是否能在OCR后进行更正,提高下游自然语言处理(NLP)任务的效果,以及提供社会文化背景作为更正过程的一部分的价值。实验在三个数据集上进行:19世纪期刊系列(NCSE)和Overproof收藏中的两个数据集。结果表明,一些LMS可以显著降低错误率,最高性能的模型在NCSE数据集上的字符错误率降低了60%以上。OCR改进也延伸到下游任务,如命名实体识别,增强的余弦命名实体相似性。此外,研究还发现,在提示中提供社会文化背景可以提高性能,而误导性的提示会降低性能。除了这些发现之外,这项研究还释放了一个包含91篇从NCSE收集的转录文章的91篇文章的数据集,共计40,000个单词,以支持该领域进一步的研究。这些发现表明,CLOCR-C是通过利用LMs中嵌入的社会文化信息和需要更正的文本来提高现有数字档案质量的有前途的方法。
https://arxiv.org/abs/2408.17428
The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. A number of recent MLLMs achieve this goal using a mixture of vision encoders. Despite their success, there is a lack of systematic comparisons and detailed ablation studies addressing critical aspects, such as expert selection and the integration of multiple vision experts. This study provides an extensive exploration of the design space for MLLMs using a mixture of vision encoders and resolutions. Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach. We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies. We additionally introduce Pre-Alignment to bridge the gap between vision-focused encoders and language tokens, enhancing model coherence. The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks. Models and code: this https URL
准确地解释复杂视觉信息是一个多模态大型语言模型(MLLM)的重要话题。最近的工作表明,增强视觉感知在显著减少幻觉和提高分辨率敏感任务(如光学字符识别和文档分析)的表现方面至关重要。许多最近的MLLM使用混合视觉编码器来实现这一目标。尽管它们取得了成功,但缺乏系统性的比较和详细消融研究解决了一些关键方面,如专家选择和多视觉专家的集成。 本研究对使用混合视觉编码器的设计空间进行了深入探索,以设计MLLM。我们的发现揭示了各种现有策略背后的共同原理,从而实现了简洁有效的设计方法。我们发现,简单地将来自互补视觉编码器的视觉标记串联起来与更复杂的混合架构或策略相当有效。此外,我们还引入了预对齐来连接视觉关注编码器和语言标记,增强了模型的连贯性。 所得到的一组MLLM家族,Eagle,在主要MLLM基准测试中超过了其他领先的开源模型。模型和代码:这个链接
https://arxiv.org/abs/2408.15998
Most production-level deployments for Visual Question Answering (VQA) tasks are still build as processing pipelines of independent steps including image pre-processing, object- and text detection, Optical Character Recognition (OCR) and (mostly supervised) object classification. However, the recent advances in vision Foundation Models [25] and Vision Language Models (VLMs) [23] raise the question if these custom trained, multi-step approaches can be replaced with pre-trained, single-step VLMs. This paper analyzes the performance and limits of various VLMs in the context of VQA and OCR [5, 9, 12] tasks in a production-level scenario. Using data from the Retail-786k [10] dataset, we investigate the capabilities of pre-trained VLMs to answer detailed questions about advertised products in images. Our study includes two commercial models, GPT-4V [16] and GPT-4o [17], as well as four open-source models: InternVL [5], LLaVA 1.5 [12], LLaVA-NeXT [13], and CogAgent [9]. Our initial results show, that there is in general no big performance gap between open-source and commercial models. However, we observe a strong task dependent variance in VLM performance: while most models are able to answer questions regarding the product brand and price with high accuracy, they completely fail at the same time to correctly identity the specific product name or discount. This indicates the problem of VLMs to solve fine-grained classification tasks as well to model the more abstract concept of discounts.
大多数生产级别的视觉问答(VQA)任务的部署仍然是构建处理流程的独立步骤,包括图像预处理、目标检测、光学字符识别(OCR)和(主要是监督)目标分类。然而,近期在视觉基础模型[25]和视觉语言模型(VLMs)[23]方面的进展引发了一个问题,即这些自定义训练、多步骤方法是否可以被替换为预训练、单步骤VLMs。本文在生产级别场景下分析各种VLMs的性能和局限性[5, 9, 12]。使用零售786k[10]数据集,我们研究了预训练VLMs在回答图像中广告产品的详细问题方面的能力。我们的研究包括两个商业模型GPT-4V[16]和GPT-4o[17],以及四个开源模型:InternVL[5],LLaVA 1.5[12],LLaVA-NeXT[13]和CogAgent[9]。我们最初的结果表明,开源模型和商业模型之间的性能差距通常并不大。然而,我们观察到VLM性能的强烈任务相关方差:虽然大多数模型能够高精度地回答关于产品品牌和价格的问题,但它们同时完全无法正确识别具体产品名称或折扣。这表明VLMs在解决细粒度分类任务和建模折扣更抽象概念方面存在问题。
https://arxiv.org/abs/2408.15626
This research paper introduces an innovative word-level Optical Character Recognition (OCR) model specifically designed for digital Urdu text recognition. Utilizing transformer-based architectures and attention mechanisms, the model was trained on a comprehensive dataset of approximately 160,000 Urdu text images, achieving a character error rate (CER) of 0.178, which highlights its superior accuracy in recognizing Urdu characters. The model's strength lies in its unique architecture, incorporating the permuted autoregressive sequence (PARSeq) model, which allows for context-aware inference and iterative refinement by leveraging bidirectional context information to enhance recognition accuracy. Furthermore, its capability to handle a diverse range of Urdu text styles, fonts, and variations enhances its applicability in real-world scenarios. Despite its promising results, the model has some limitations, such as difficulty with blurred images, non-horizontal orientations, and overlays of patterns, lines, or other text, which can occasionally lead to suboptimal performance. Additionally, trailing or following punctuation marks can introduce noise into the recognition process. Addressing these challenges will be a focus of future research, aiming to refine the model further, explore data augmentation techniques, optimize hyperparameters, and integrate contextual improvements for more accurate and efficient Urdu text recognition.
本文提出了一种专门针对数字乌尔都语文本识别的创新词级光学字符识别(OCR)模型。该模型利用变压器架构和注意力机制在综合数据集上训练,约160,000个乌尔都语文本图像,实现了一个字符错误率(CER)为0.178,这表明其在识别乌尔都字符方面的优越性。该模型的优势在于其独特的架构,包括通过使用可变自回归序列(PARSeq)模型实现上下文感知推理和迭代改进,以便利用双向上下文信息提高识别准确性。此外,它能处理各种乌尔都语文本风格、字体和变化,使其在现实场景中具有更广泛的适用性。尽管其取得了很好的效果,但该模型仍存在一些局限性,例如对模糊图像处理困难,不规则的旋转方向,以及图形的叠加,这些偶尔会导致性能最优。此外,尾随或紧跟标点符号可能会引入噪声到识别过程。解决这些挑战将是未来研究的重点,旨在进一步优化模型,探索数据增强技术,优化超参数,并实现对乌尔都语文本识别的上下文改进,使其更加准确和高效。
https://arxiv.org/abs/2408.15119
Despite significant advancements in License Plate Recognition (LPR) through deep learning, most improvements rely on high-resolution images with clear characters. This scenario does not reflect real-world conditions where traffic surveillance often captures low-resolution and blurry images. Under these conditions, characters tend to blend with the background or neighboring characters, making accurate LPR challenging. To address this issue, we introduce a novel loss function, Layout and Character Oriented Focal Loss (LCOFL), which considers factors such as resolution, texture, and structural details, as well as the performance of the LPR task itself. We enhance character feature learning using deformable convolutions and shared weights in an attention module and employ a GAN-based training approach with an Optical Character Recognition (OCR) model as the discriminator to guide the super-resolution process. Our experimental results show significant improvements in character reconstruction quality, outperforming two state-of-the-art methods in both quantitative and qualitative measures. Our code is publicly available at this https URL
尽管通过深度学习在车牌识别(LPR)方面取得了显著的进步,但大多数改进都依赖于高分辨率、清晰字符的图像。这种情况并不代表实际情况,因为交通监控通常捕捉低分辨率、模糊的图像。在这种情况下,字符往往与背景或相邻字符融合在一起,使得准确的车牌识别(LPR)变得具有挑战性。为了解决这个问题,我们引入了一个新的损失函数——布局和字符定向焦点损失(LCOFL),它考虑了分辨率、纹理和结构细节等因素,以及LPR任务的性能本身。我们通过变形卷积和注意力模块中的共享权重来提高字符特征学习,并使用基于GAN的训练方法,其中OCR模型作为判别器来引导超分辨率过程。我们的实验结果表明,在定量和定性指标上,车牌重建质量都有显著的提高,超越了两个最先进的解决方案。我们的代码公开可用,在这个https URL。
https://arxiv.org/abs/2408.15103
Optical Music Recognition (OMR) automates the transcription of musical notation from images into machine-readable formats like MusicXML, MEI, or MIDI, significantly reducing the costs and time of manual transcription. This study explores knowledge discovery in OMR by applying instance segmentation using Mask R-CNN to enhance the detection and delineation of musical symbols in sheet music. Unlike Optical Character Recognition (OCR), OMR must handle the intricate semantics of Common Western Music Notation (CWMN), where symbol meanings depend on shape, position, and context. Our approach leverages instance segmentation to manage the density and overlap of musical symbols, facilitating more precise information retrieval from music scores. Evaluations on the DoReMi and MUSCIMA++ datasets demonstrate substantial improvements, with our method achieving a mean Average Precision (mAP) of up to 59.70\% in dense symbol environments, achieving comparable results to object detection. Furthermore, using traditional computer vision techniques, we add a parallel step for staff detection to infer the pitch for the recognised symbols. This study emphasises the role of pixel-wise segmentation in advancing accurate music symbol recognition, contributing to knowledge discovery in OMR. Our findings indicate that instance segmentation provides more precise representations of musical symbols, particularly in densely populated scores, advancing OMR technology. We make our implementation, pre-processing scripts, trained models, and evaluation results publicly available to support further research and development.
光音乐识别(OMR)自动化了将音乐乐谱图像从图像中转录为机器可读格式的过程,从而显著减少了手动转录的成本和时间。本研究探讨了OMR通过应用Mask R-CNN实例分割来增强乐谱中音乐符号的检测和绘制。与光学字符识别(OCR)不同,OMR必须处理共同西方音乐乐谱(CWMN)中复杂的语义,其中符号的意义取决于形状、位置和上下文。我们的方法利用实例分割来管理音乐符号的密度和重叠,促进更精确的信息检索从乐谱中。在DoReMi和MUSCIMA+数据集上的评估表明,我们的方法取得了显著的改进,平均精度(mAP)可以达到59.70%。我们的方法与物体检测相当,进一步证明了像素级分割在提高OMR技术准确音乐符号识别中的作用。本研究强调了在推动OMR技术的发展中,像素级分割的重要性。我们的研究结果表明,实例分割提供了更精确的音乐符号表示,特别是在密集符号的乐谱中,推动了OMR技术的发展。我们将我们的实现、预处理脚本、训练模型和评估结果公开发布,以支持进一步研究和开发。
https://arxiv.org/abs/2408.15002
The proliferation of scene text in both structured and unstructured environments presents significant challenges in optical character recognition (OCR), necessitating more efficient and robust text spotting solutions. This paper presents FastTextSpotter, a framework that integrates a Swin Transformer visual backbone with a Transformer Encoder-Decoder architecture, enhanced by a novel, faster self-attention unit, SAC2, to improve processing speeds while maintaining accuracy. FastTextSpotter has been validated across multiple datasets, including ICDAR2015 for regular texts and CTW1500 and TotalText for arbitrary-shaped texts, benchmarking against current state-of-the-art models. Our results indicate that FastTextSpotter not only achieves superior accuracy in detecting and recognizing multilingual scene text (English and Vietnamese) but also improves model efficiency, thereby setting new benchmarks in the field. This study underscores the potential of advanced transformer architectures in improving the adaptability and speed of text spotting applications in diverse real-world settings. The dataset, code, and pre-trained models have been released in our Github.
场景文本在结构和非结构化环境中的普遍存在给光学字符识别(OCR)带来了巨大的挑战,需要更高效和稳健的文本检测解决方案。本文介绍了一个名为FastTextSpotter的框架,它将Swin Transformer视觉骨干与Transformer Encoder-Decoder架构集成,并新增了一个新的自注意力单元SAC2,以提高处理速度的同时保持准确性。FastTextSpotter已经在多个数据集上进行了验证,包括ICDAR2015(普通文本)和CTW1500(任意形状文本),并与当前最先进的模型进行了比较。我们的结果表明,FastTextSpotter不仅实现了对多语言场景文本(英语和越南语)的卓越检测和识别,还提高了模型的效率,从而在领域内设置了新的基准。本研究突出了高级Transformer架构在提高文本检测应用程序的适应性和速度方面的潜力。数据集、代码和预训练模型已发布在我们的Github上。
https://arxiv.org/abs/2408.14998
Many languages have vast amounts of handwritten texts, such as ancient scripts about folktale stories and historical narratives or contemporary documents and letters. Digitization of those texts has various applications, such as daily tasks, cultural studies, and historical research. Syriac is an ancient, endangered, and low-resourced language that has not received the attention it requires and deserves. This paper reports on a research project aimed at developing a optical character recognition (OCR) model based on the handwritten Syriac texts as a starting point to build more digital services for this endangered language. A dataset was created, KHAMIS (inspired by the East Syriac poet, Khamis bar Qardahe), which consists of handwritten sentences in the East Syriac script. We used it to fine-tune the Tesseract-OCR engine's pretrained Syriac model on handwritten data. The data was collected from volunteers capable of reading and writing in the language to create KHAMIS. KHAMIS currently consists of 624 handwritten Syriac sentences collected from 31 university students and one professor, and it will be partially available online and the whole dataset available in the near future for development and research purposes. As a result, the handwritten OCR model was able to achieve a character error rate of 1.097-1.610% and 8.963-10.490% on both training and evaluation sets, respectively, and both a character error rate of 18.89-19.71% and a word error rate of 62.83-65.42% when evaluated on the test set, which is twice as better than the default Syriac model of Tesseract.
许多语言都有大量的手写文本,如关于民间故事和历史叙述的古代脚本或当代文件和信件。对这些文本的数字化具有各种应用,如日常生活、文化研究和历史研究。Syriac是一种古老、濒危且资源有限的语言,尚未得到其所需和应得的关注。本文报道了一个旨在开发基于手写Syriac文本的光学字符识别(OCR)模型的研究项目,作为为这种濒危语言创建更多数字服务的起点。 一个数据集KHAMIS(受到叙利亚诗人Khamis bar Qardahe的启发)被创建了,它由来自能够阅读和写作该语言的志愿者创建。KHAMIS目前包括来自31名大学生和一名教授的624个手写Syriac句子。该数据集将在未来部分在线发布,整个数据集将在开发和研究过程中发布。 因此,手写OCR模型能够达到训练和评估集上的字符错误率分别为1.097-1.610%和8.963-10.490%,以及测试集上的字符错误率为18.89-19.71%和单词错误率为62.83-65.42%。这比Tesseract默认的Syriac模型还要好,后者的字符错误率和单词错误率分别为1.61-1.70%和27.66-28.51%。
https://arxiv.org/abs/2408.13631
In this report, we introduce Vintern-1B, a reliable 1-billion-parameters multimodal large language model (MLLM) for Vietnamese language tasks. By integrating the Qwen2-0.5B-Instruct language model with the InternViT-300M-448px visual model, Vintern-1B is optimized for a range of applications, including optical character recognition (OCR), document extraction, and general question-answering in Vietnamese context. The model is fine-tuned on an extensive dataset of over 3 million image-question-answer pairs, achieving robust performance and reliable results across multiple Vietnamese language benchmarks like OpenViVQA and ViTextVQA. Vintern-1B is small enough to fit into various on-device applications easily. Additionally, we have open-sourced several Vietnamese vision question answering (VQA) datasets for text and diagrams, created with Gemini 1.5 Flash. Our models are available at: this https URL.
在这份报告中,我们介绍了Vinnter-1B,一个可靠的10亿参数的多模态大型语言模型(MLLM)用于越南语任务。通过将Qwen2-0.5B-Instruct语言模型与InternViT-300M-448px视觉模型相结合,Vinnter-1B被优化用于各种应用,包括光学字符识别(OCR)、文档提取和越南上下文中的通用问题回答。在超过3000万图像-问题对的大型数据集上进行微调,Vinnter-1B在多个越南语言基准测试(如OpenViVQA和ViTextVQA)上实现稳健的性能和可靠的结果。Vinnter-1B足够小,可以轻松地安装到各种设备上。此外,我们还开源了几个用于文本和图形的越南视觉问题回答(VQA)数据集,使用Gemini 1.5 Flash创建。我们的模型可以从:https://这个链接获取。
https://arxiv.org/abs/2408.12480
Page Stream Segmentation (PSS) is an essential prerequisite for automated document processing at scale. However, research progress has been limited by the absence of realistic public benchmarks. This paper works towards addressing this gap by introducing TABME++, an enhanced benchmark featuring commercial Optical Character Recognition (OCR) annotations. We evaluate the performance of large language models (LLMs) on PSS, focusing on decoder-based models fine-tuned with parameter-efficient methods. Our results show that decoder-based LLMs outperform smaller multimodal encoders. Through a review of existing PSS research and datasets, we identify key challenges and advancements in the field. Our findings highlight the key importance of robust OCR, providing valuable insights for the development of more effective document processing systems.
页流分割(PSS)是大规模自动文档处理不可或缺的前提条件。然而,研究受到了缺乏现实公共基准的制约。本文通过引入TABME++,一个包含商业光学字符识别(OCR)注释的增强基准,来解决这一空白。我们评估了大型语言模型(LLMs)在PSS上的性能,重点关注使用参数高效方法微调的解码器基模型。我们的结果表明,解码器基LLMs优于较小的多模态编码器。通过回顾现有的PSS研究和数据集,我们确定了该领域中的关键挑战和进展。我们的发现强调了 robust OCR 的重要性,为开发更有效的文档处理系统提供了宝贵的洞见。
https://arxiv.org/abs/2408.11981