Video-based Automatic License Plate Recognition (ALPR) involves extracting vehicle license plate text information from video captures. Traditional systems typically rely heavily on high-end computing resources and utilize multiple frames to recognize license plates, leading to increased computational overhead. In this paper, we propose two methods capable of efficiently extracting exactly one frame per vehicle and recognizing its license plate characters from this single image, thus significantly reducing computational demands. The first method uses Visual Rhythm (VR) to generate time-spatial images from videos, while the second employs Accumulative Line Analysis (ALA), a novel algorithm based on single-line video processing for real-time operation. Both methods leverage YOLO for license plate detection within the frame and a Convolutional Neural Network (CNN) for Optical Character Recognition (OCR) to extract textual information. Experiments on real videos demonstrate that the proposed methods achieve results comparable to traditional frame-by-frame approaches, with processing speeds three times faster.
基于视频的自动车牌识别(ALPR)涉及从视频捕获中提取车辆牌照的文字信息。传统的系统通常依赖高端计算资源,并通过使用多个帧来识别车牌,从而增加了计算负担。在这篇论文中,我们提出了两种方法,能够在每辆车上有效提取一个单独的画面并识别该单个图像中的车牌字符,这大大减少了计算需求。第一种方法利用视觉节奏(VR)从视频生成时空间图象,而第二种则采用累积线分析(ALA),这是一种基于单行视频处理的新型算法,适用于实时操作。 这两种方法都使用YOLO进行帧内的车牌检测,并且通过卷积神经网络(CNN)来进行光学字符识别(OCR),以提取文字信息。在真实视频上的实验表明,所提出的方法能够达到与传统逐帧方法相当的结果,但处理速度提高了三倍。
https://arxiv.org/abs/2501.04750
This research focuses on developing a method for restoring the topology of digital images of paper documents captured by a camera, using algorithms for detection, segmentation, geometry restoration, and dewarping. Our methodology employs deep learning (DL) for document outline detection, followed by computer vision (CV) to create a topological 2D grid using cubic polynomial interpolation and correct nonlinear distortions by remapping the image. Using classical CV methods makes the document topology restoration process more efficient and faster, as it requires significantly fewer computational resources and memory. We developed a new pipeline for automatic document dewarping and reconstruction, along with a framework and annotated dataset to demonstrate its efficiency. Our experiments confirm the promise of our methodology and its superiority over existing benchmarks (including mobile apps and popular DL solutions, such as RectiNet, DocGeoNet, and DocTr++) both visually and in terms of document readability via Optical Character Recognition (OCR) and geometry restoration metrics. This paves the way for creating high-quality digital copies of paper documents and enhancing the efficiency of OCR systems. Project page: this https URL
这项研究致力于开发一种方法,用于恢复通过相机捕捉的纸质文档数字图像的拓扑结构。该方法利用检测、分割、几何恢复和去扭曲等算法实现这一目标。我们的方法采用深度学习(DL)技术来检测文件轮廓,并随后运用计算机视觉(CV)技术创建一个基于立方多项式插值的二维网格,以此纠正非线性失真并重新映射图像。通过使用传统的计算机视觉方法使文档拓扑结构恢复过程更加高效和快速,因为它需要显著较少的计算资源和内存。 我们开发了一套新的自动文件去扭曲与重建流水线,并构建了一个框架及注释数据集来展示其效率。实验结果证实了我们的方法在可视化效果以及通过光学字符识别(OCR)和几何恢复指标评估文档可读性方面均优于现有的基准测试(包括移动应用和流行的深度学习解决方案,如RectiNet、DocGeoNet和DocTr++)。这为创建纸质文档的高质量数字副本并提高OCR系统的效率铺平了道路。项目页面:[此URL](https://this.example.com/)
https://arxiv.org/abs/2501.03145
Generating visual text in natural scene images is a challenging task with many unsolved problems. Different from generating text on artificially designed images (such as posters, covers, cartoons, etc.), the text in natural scene images needs to meet the following four key criteria: (1) Fidelity: the generated text should appear as realistic as a photograph and be completely accurate, with no errors in any of the strokes. (2) Reasonability: the text should be generated on reasonable carrier areas (such as boards, signs, walls, etc.), and the generated text content should also be relevant to the scene. (3) Utility: the generated text can facilitate to the training of natural scene OCR (Optical Character Recognition) tasks. (4) Controllability: The attribute of the text (such as font and color) should be controllable as this http URL this paper, we propose a two stage method, SceneVTG++, which simultaneously satisfies the four aspects mentioned above. SceneVTG++ consists of a Text Layout and Content Generator (TLCG) and a Controllable Local Text Diffusion (CLTD). The former utilizes the world knowledge of multi modal large language models to find reasonable text areas and recommend text content according to the nature scene background images, while the latter generates controllable multilingual text based on the diffusion model. Through extensive experiments, we respectively verified the effectiveness of TLCG and CLTD, and demonstrated the state-of-the-art text generation performance of SceneVTG++. In addition, the generated images have superior utility in OCR tasks like text detection and text recognition. Codes and datasets will be available.
在自然场景图像中生成视觉文本是一项充满挑战的任务,许多问题尚未解决。与在人工设计的图像(如海报、封面、卡通等)上生成文字不同,在自然场景图像中的文字需要满足以下四个关键标准:(1) 真实性:生成的文字应该看起来像一张照片一样逼真,并且完全准确,没有任何笔画错误。(2) 合理性:文本应当出现在合理的载体区域(如板子、标识牌、墙壁等),并且所生成的文本内容也应与场景相关。(3) 实用性:生成的文本能够促进自然场景OCR(光学字符识别)任务的训练。(4) 可控性:文字属性(如字体和颜色)应该可以控制。 在这篇论文中,我们提出了一种两阶段方法——SceneVTG++,它同时满足上述四个方面的需求。SceneVTG++由文本布局及内容生成器(TLCG) 和可控局部文本扩散(CLTD)组成。前者利用多模态大型语言模型的世界知识来寻找合理的文字区域并根据自然场景背景图像推荐文字内容,而后者则基于扩散模型生成可控制的多语言文本。通过广泛的实验,我们分别验证了TLCG和CLTD的有效性,并展示了SceneVTG++在文本生成性能方面的先进水平。此外,所生成的图像在OCR任务(如文本检测、文本识别)中具有极高的实用性。代码及数据集将会公开提供。
https://arxiv.org/abs/2501.02962
Automatic License Plate Recognition (ALPR) involves extracting vehicle license plate information from image or a video capture. These systems have gained popularity due to the wide availability of low-cost surveillance cameras and advances in Deep Learning. Typically, video-based ALPR systems rely on multiple frames to detect the vehicle and recognize the license plates. Therefore, we propose a system capable of extracting exactly one frame per vehicle and recognizing its license plate characters from this singular image using an Optical Character Recognition (OCR) model. Early experiments show that this methodology is viable.
自动车牌识别(ALPR)是指从图像或视频捕捉中提取车辆的车牌信息。由于低成本监控摄像头的广泛普及和深度学习技术的进步,这类系统变得越来越流行。通常,基于视频的ALPR系统依赖于多个帧来检测车辆并识别其车牌。因此,我们提出了一种能够为每辆车精确提取一个帧,并利用光学字符识别(OCR)模型从这一单一图像中识别出车牌字符的系统。早期实验表明这种方法是可行的。
https://arxiv.org/abs/2501.02270
Super-resolution (SR) techniques play a pivotal role in enhancing the quality of low-resolution images, particularly for applications such as security and surveillance, where accurate license plate recognition is crucial. This study proposes a novel framework that combines pixel-based loss with embedding similarity learning to address the unique challenges of license plate super-resolution (LPSR). The introduced pixel and embedding consistency loss (PECL) integrates a Siamese network and applies contrastive loss to force embedding similarities to improve perceptual and structural fidelity. By effectively balancing pixel-wise accuracy with embedding-level consistency, the framework achieves superior alignment of fine-grained features between high-resolution (HR) and super-resolved (SR) license plates. Extensive experiments on the CCPD dataset validate the efficacy of the proposed framework, demonstrating consistent improvements over state-of-the-art methods in terms of PSNR_RGB, PSNR_Y and optical character recognition (OCR) accuracy. These results highlight the potential of embedding similarity learning to advance both perceptual quality and task-specific performance in extreme super-resolution scenarios.
超分辨率(SR)技术在提升低分辨率图像质量方面扮演着重要角色,尤其是在安全和监控等应用中,准确的车牌识别至关重要。本研究提出了一种结合像素损失与嵌入相似性学习的新框架,以应对车牌超分辨率(LPSR)的独特挑战。引入的像素和嵌入一致性损失(PECL)整合了Siamese网络,并应用对比损失来强制提高嵌入的相似度,从而增强感知和结构保真度。通过有效平衡逐像素精度与嵌入级的一致性,该框架实现了高清(HR)车牌与超分辨率(SR)车牌之间细粒度特征的卓越对齐。 在CCPD数据集上的广泛实验验证了所提框架的有效性,在PSNR_RGB、PSNR_Y和光学字符识别(OCR)准确率方面均表现出优于现有方法的持续改进。这些结果突显了嵌入相似性学习在极端超分辨率场景中提升感知质量和任务特定性能的巨大潜力。
https://arxiv.org/abs/2501.01483
Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest recently. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities on certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4x more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios including street scene, receipt, formula, diagram, and so on), and thorough evaluation metrics, with a total of 10,000 human-verified question-answering pairs and a high proportion of difficult samples. After carefully benchmarking state-of-the-art LMMs on OCRBench v2, we find that 20 out of 22 LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. The benchmark and evaluation scripts are available at this https URL.
最近,对大型多模态模型(LMMs)的光学字符识别(OCR)能力进行评分引起了越来越多的兴趣。现有的基准测试已经展示了LMMs在文本识别方面的出色表现;然而,它们在某些具有挑战性的任务上,如文本定位、手写内容提取和逻辑推理方面的能力仍然未被充分探索。为了解决这一差距,我们推出了OCRBench v2,这是一个大规模的双语文本中心基准测试平台,拥有目前最全面的任务集(比之前的多场景基准测试OCRBench多了四倍),涵盖了广泛的场景类型(包括街道、收据、公式、图表等共31种多样化的应用场景)以及详细的评估指标。该平台包含总计10,000个经过人工验证的问题-答案对,并且其中包含大量难以处理的样本。通过在OCRBench v2上仔细测试最先进的LMMs,我们发现,在总分为100分的情况下,有20个中的22个模型得分低于50分,并且存在五类限制:较少遇到的文字识别、细粒度感知、布局感知、复杂元素解析和逻辑推理。基准数据集及评估脚本可在以下网址获取:[此链接处应填写实际的URL]。
https://arxiv.org/abs/2501.00321
With the rise of multimodal large language models, accurately extracting and understanding textual information from video content, referred to as video based optical character recognition (Video OCR), has become a crucial capability. This paper introduces a novel benchmark designed to evaluate the video OCR performance of multi-modal models in videos. Comprising 1,028 videos and 2,961 question-answer pairs, this benchmark proposes several key challenges through 6 distinct subtasks: (1) Recognition of text content itself and its basic visual attributes, (2)Semantic and Spatial Comprehension of OCR objects in videos (3) Dynamic Motion detection and Temporal Localization. We developed this benchmark using a semi-automated approach that integrates the OCR ability of image LLMs with manual refinement, balancing efficiency, cost, and data quality. Our resource aims to help advance research in video LLMs and underscores the need for improving OCR ability for video LLMs. The benchmark will be released on this https URL.
随着多模态大型语言模型的兴起,准确地从视频内容中提取和理解文本信息——被称为基于视频的文字识别(Video OCR)——已成为一项关键能力。本文介绍了一种新的基准测试工具,旨在评估多模态模型在处理视频时的视频OCR性能。该基准由1,028个视频和2,961组问答对组成,并通过六个不同的子任务提出了若干核心挑战:(1) 识别文本内容及其基本视觉属性;(2) 理解视频中OCR对象的语义及空间关系;(3) 动态运动检测与时间定位。我们采用了一种半自动的方法来开发这一基准,这种方法结合了图像LLM的文字识别能力和人工精炼过程,在效率、成本和数据质量之间实现了平衡。我们的资源旨在帮助推进视频LLM的研究,并强调改进视频LLM文字识别能力的需求。该基准将在[此链接](https://this https URL)发布。
https://arxiv.org/abs/2412.20613
This research paper delves into the development of an Optical Character Recognition (OCR) system for the recognition of Ashokan Brahmi characters using Convolutional Neural Networks. It utilizes a comprehensive dataset of character images to train the models, along with data augmentation techniques to optimize the training process. Furthermore, the paper incorporates image preprocessing to remove noise, as well as image segmentation to facilitate line and character segmentation. The study mainly focuses on three pre-trained CNNs, namely LeNet, VGG-16, and MobileNet and compares their accuracy. Transfer learning was employed to adapt the pre-trained models to the Ashokan Brahmi character dataset. The findings reveal that MobileNet outperforms the other two models in terms of accuracy, achieving a validation accuracy of 95.94% and validation loss of 0.129. The paper provides an in-depth analysis of the implementation process using MobileNet and discusses the implications of the findings. The use of OCR for character recognition is of significant importance in the field of epigraphy, specifically for the preservation and digitization of ancient scripts. The results of this research paper demonstrate the effectiveness of using pre-trained CNNs for the recognition of Ashokan Brahmi characters.
这篇研究论文探讨了使用卷积神经网络(CNN)开发一种光学字符识别(OCR)系统,用于阿什OK梵文字符的识别。该系统利用一组全面的字符图像数据集来训练模型,并采用数据增强技术以优化训练过程。此外,还采用了图像预处理方法去除噪声以及图像分割来帮助进行行和字符分割。研究主要关注三种预先训练好的CNN模型:LeNet、VGG-16 和 MobileNet,并比较了它们在准确率上的表现。利用迁移学习将这些预训练模型适应到阿什OK梵文字符的数据集中。 实验结果表明,MobileNet 模型的准确率最高,在验证集上达到了95.94% 的准确率和0.129 的损失值,优于其他两种模型。论文详细分析了使用 MobileNet 实现 OCR 系统的过程,并讨论了这些发现的意义。对于古文字学领域来说,OCR 技术在字符识别方面具有重要意义,特别是在古代文本的保存与数字化过程中。 本研究的结果表明,利用预先训练好的 CNN 模型可以有效地进行阿什OK梵文字符的识别工作。
https://arxiv.org/abs/2501.01981
This paper presents the Visual Optical Recognition Telemetry EXtraction (VORTEX) system for extracting and analyzing drone telemetry data from First Person View (FPV) Uncrewed Aerial System (UAS) footage. VORTEX employs MMOCR, a PyTorch-based Optical Character Recognition (OCR) toolbox, to extract telemetry variables from drone Heads Up Display (HUD) recordings, utilizing advanced image preprocessing techniques, including CLAHE enhancement and adaptive thresholding. The study optimizes spatial accuracy and computational efficiency through systematic investigation of temporal sampling rates (1s, 5s, 10s, 15s, 20s) and coordinate processing methods. Results demonstrate that the 5-second sampling rate, utilizing 4.07% of available frames, provides the optimal balance with a point retention rate of 64% and mean speed accuracy within 4.2% of the 1-second baseline while reducing computational overhead by 80.5%. Comparative analysis of coordinate processing methods reveals that while UTM Zone 33N projection and Haversine calculations provide consistently similar results (within 0.1% difference), raw WGS84 coordinates underestimate distances by 15-30% and speeds by 20-35%. Altitude measurements showed unexpected resilience to sampling rate variations, with only 2.1% variation across all intervals. This research is the first of its kind, providing quantitative benchmarks for establishing a robust framework for drone telemetry extraction and analysis using open-source tools and spatial libraries.
本文介绍了用于从第一人称视角(FPV)无人航空系统(UAS)视频中提取和分析无人机遥测数据的视觉光学识别遥测抽取(VORTEX)系统。VORTEX 使用基于 PyTorch 的光学字符识别(OCR)工具箱 MMOCR,从无人机平视显示器(HUD)记录中提取遥测变量,并采用先进的图像预处理技术,包括 CLAHE 增强和自适应阈值化。研究通过系统性地调查时间采样率(1秒、5秒、10秒、15秒、20秒)和坐标处理方法来优化空间准确性和计算效率。结果表明,使用 5 秒采样率时,利用可获得帧数的 4.07%,提供了最佳平衡点保留率为 64%,平均速度准确性在 1 秒基准线的 ±4.2% 范围内,并且减少了 80.5% 的计算开销。比较分析坐标处理方法显示,虽然 UTM 区域 33N 投影和 Haversine 计算提供了几乎相同的结果(差异在 0.1% 内),但原始 WGS84 坐标低估了距离的 15-30% 和速度的 20-35%。高度测量显示出对采样率变化的意外鲁棒性,所有间隔内的变异仅为 2.1%。这项研究是开创性的,为使用开源工具和空间库建立稳健的无人机遥测提取和分析框架提供了定量基准。
https://arxiv.org/abs/2412.18505
This paper presents ERPA, an innovative Robotic Process Automation (RPA) model designed to enhance ID data extraction and optimize Optical Character Recognition (OCR) tasks within immigration workflows. Traditional RPA solutions often face performance limitations when processing large volumes of documents, leading to inefficiencies. ERPA addresses these challenges by incorporating Large Language Models (LLMs) to improve the accuracy and clarity of extracted text, effectively handling ambiguous characters and complex structures. Benchmark comparisons with leading platforms like UiPath and Automation Anywhere demonstrate that ERPA significantly reduces processing times by up to 94 percent, completing ID data extraction in just 9.94 seconds. These findings highlight ERPA's potential to revolutionize document automation, offering a faster and more reliable alternative to current RPA solutions.
本文介绍了ERPA,这是一种创新的机器人流程自动化(RPA)模型,旨在提升身份证数据提取效率并优化移民工作流中的光学字符识别(OCR)任务。传统的RPA解决方案在处理大量文档时常常面临性能限制,导致工作效率低下。ERPA通过融入大型语言模型(LLMs),提高了提取文本的准确性和清晰度,有效应对模糊字符和复杂结构带来的挑战。与UiPath和Automation Anywhere等领先平台进行基准比较后发现,ERPA能够将处理时间最多缩短94%,身份证数据提取仅需9.94秒即可完成。这些研究成果突显了ERPA有潜力彻底改变文档自动化领域,为现有的RPA解决方案提供更快、更可靠的替代方案。
https://arxiv.org/abs/2412.19840
Extracting medication names from handwritten doctor prescriptions is challenging due to the wide variability in handwriting styles and prescription formats. This paper presents a robust method for extracting medicine names using a combination of Mask R-CNN and Transformer-based Optical Character Recognition (TrOCR) with Multi-Head Attention and Positional Embeddings. A novel dataset, featuring diverse handwritten prescriptions from various regions of Pakistan, was utilized to fine-tune the model on different handwriting styles. The Mask R-CNN model segments the prescription images to focus on the medicinal sections, while the TrOCR model, enhanced by Multi-Head Attention and Positional Embeddings, transcribes the isolated text. The transcribed text is then matched against a pre-existing database for accurate identification. The proposed approach achieved a character error rate (CER) of 1.4% on standard benchmarks, highlighting its potential as a reliable and efficient tool for automating medicine name extraction.
从手写医生处方中提取药品名称具有挑战性,因为书写风格和处方格式的多样性。本文介绍了一种结合使用Mask R-CNN和基于Transformer的光学字符识别(TrOCR)的方法来提取药品名称,该方法采用了多头注意力机制和位置嵌入技术。研究中利用了一个新数据集,其中包含了来自巴基斯坦不同地区的多样化手写处方,用以调整模型以适应不同的书写风格。Mask R-CNN模型用于分割处方图像,专注于药物部分,而由多头注意力机制和位置嵌入增强的TrOCR模型则负责转录孤立文本。随后将转录后的文本与现有的数据库进行匹配,以实现准确识别。所提出的方法在标准基准测试中达到了1.4%的字符错误率(CER),显示出其作为自动提取药品名称可靠而高效的工具的巨大潜力。
https://arxiv.org/abs/2412.18199
This paper introduces LMRPA, a novel Large Model-Driven Robotic Process Automation (RPA) model designed to greatly improve the efficiency and speed of Optical Character Recognition (OCR) tasks. Traditional RPA platforms often suffer from performance bottlenecks when handling high-volume repetitive processes like OCR, leading to a less efficient and more time-consuming process. LMRPA allows the integration of Large Language Models (LLMs) to improve the accuracy and readability of extracted text, overcoming the challenges posed by ambiguous characters and complex text this http URL benchmarks were conducted comparing LMRPA to leading RPA platforms, including UiPath and Automation Anywhere, using OCR engines like Tesseract and DocTR. The results are that LMRPA achieves superior performance, cutting the processing times by up to 52\%. For instance, in Batch 2 of the Tesseract OCR task, LMRPA completed the process in 9.8 seconds, where UiPath finished in 18.1 seconds and Automation Anywhere finished in 18.7 seconds. Similar improvements were observed with DocTR, where LMRPA outperformed other automation tools conducting the same process by completing tasks in 12.7 seconds, while competitors took over 20 seconds to do the same. These findings highlight the potential of LMRPA to revolutionize OCR-driven automation processes, offering a more efficient and effective alternative solution to the existing state-of-the-art RPA models.
本文介绍了LMRPA,这是一种新型的大型模型驱动的机器人流程自动化(RPA)模型,旨在大幅提高光学字符识别(OCR)任务的效率和速度。传统的RPA平台在处理高容量重复性过程(如OCR)时经常遇到性能瓶颈,导致过程不够高效且耗时较长。LMRPA允许集成大型语言模型(LLMs),以改进提取文本的准确性和可读性,克服由模糊字符和复杂文本引起的挑战。在此基准测试中,将LMRPA与领先的RPA平台(包括UiPath和Automation Anywhere)进行了比较,并使用了诸如Tesseract和DocTR等OCR引擎进行测试。结果显示,LMRPA实现了更优的性能,处理时间最多减少了52%。例如,在Tesseract OCR任务的第二批次中,LMRPA仅用9.8秒完成过程,而UiPath耗时18.1秒,Automation Anywhere则耗时18.7秒。类似地,在使用DocTR进行测试时也观察到了显著改进:在同一过程中,LMRPA仅需12.7秒完成任务,相比之下竞争对手需要超过20秒才能完成相同任务。这些发现突显了LMRPA在革命化OCR驱动的自动化过程方面的潜力,为现有的最先进RPA模型提供了一种更高效和有效的替代解决方案。
https://arxiv.org/abs/2412.18063
Automating high-volume unstructured data processing is essential for operational efficiency. Optical Character Recognition (OCR) is critical but often struggles with accuracy and efficiency in complex layouts and ambiguous text. These challenges are especially pronounced in large-scale tasks requiring both speed and precision. This paper introduces LMV-RPA, a Large Model Voting-based Robotic Process Automation system to enhance OCR workflows. LMV-RPA integrates outputs from OCR engines such as Paddle OCR, Tesseract OCR, Easy OCR, and DocTR with Large Language Models (LLMs) like LLaMA 3 and Gemini-1.5-pro. Using a majority voting mechanism, it processes OCR outputs into structured JSON formats, improving accuracy, particularly in complex layouts. The multi-phase pipeline processes text extracted by OCR engines through LLMs, combining results to ensure the most accurate outputs. LMV-RPA achieves 99 percent accuracy in OCR tasks, surpassing baseline models with 94 percent, while reducing processing time by 80 percent. Benchmark evaluations confirm its scalability and demonstrate that LMV-RPA offers a faster, more reliable, and efficient solution for automating large-scale document processing tasks.
自动化高容量非结构化数据处理对于操作效率至关重要。光学字符识别(OCR)虽然关键,但在复杂布局和模糊文本中往往难以保证准确性和效率。这些挑战在需要同时具备速度与精度的大规模任务中尤为突出。本文介绍了一种基于大模型投票的机器人流程自动化系统——LMV-RPA,以提升OCR工作流的效果。LMV-RPA集成了包括Paddle OCR、Tesseract OCR、Easy OCR和DocTR在内的多个OCR引擎输出,并结合了诸如LLaMA 3和Gemini-1.5-pro等大型语言模型(LLMs)。通过多数投票机制,它将OCR的输出转换为结构化的JSON格式,特别是在复杂布局中提高了准确率。多阶段管道流程对OCR引擎提取出的文字进行处理并整合结果,以确保最高准确性。LMV-RPA在OCR任务中的准确率达到99%,优于基准模型94%的准确率,并且降低了80%的处理时间。基准评估确认了其可扩展性,并表明LMV-RPA为自动化大规模文档处理任务提供了一个更快、更可靠和高效的解决方案。
https://arxiv.org/abs/2412.17965
This study investigates the potential of Large Language Models (LLMs), particularly GPT-4o, for Optical Character Recognition (OCR) in low-resource scripts such as Urdu, Albanian, and Tajik, with English serving as a benchmark. Using a meticulously curated dataset of 2,520 images incorporating controlled variations in text length, font size, background color, and blur, the research simulates diverse real-world challenges. Results emphasize the limitations of zero-shot LLM-based OCR, particularly for linguistically complex scripts, highlighting the need for annotated datasets and fine-tuned models. This work underscores the urgency of addressing accessibility gaps in text digitization, paving the way for inclusive and robust OCR solutions for underserved languages.
本研究调查了大型语言模型(LLMs),特别是GPT-4o,在低资源脚本如乌尔都语、阿尔巴尼亚语和塔吉克语中的光学字符识别(OCR)潜力,英语作为基准。使用了一个精心策划的数据集,包含2,520张图像,并控制了文本长度、字体大小、背景颜色和模糊等变量的变化,该研究模拟了多样化的现实世界挑战。结果显示了零样本LLM OCR的局限性,特别是在处理语言结构复杂的脚本时,强调了需要标注数据集和微调模型的重要性。这项工作突出了解决文本数字化可访问性差距的紧迫性,为服务不足的语言提供包容性和强大的OCR解决方案铺平了道路。
https://arxiv.org/abs/2412.16119
Recently, the joint design of optical systems and downstream algorithms is showing significant potential. However, existing rays-described methods are limited to optimizing geometric degradation, making it difficult to fully represent the optical characteristics of complex, miniaturized lenses constrained by wavefront aberration or diffraction effects. In this work, we introduce a precise optical simulation model, and every operation in pipeline is differentiable. This model employs a novel initial value strategy to enhance the reliability of intersection calculation on high aspherics. Moreover, it utilizes a differential operator to reduce memory consumption during coherent point spread function calculations. To efficiently address various degradation, we design a joint optimization procedure that leverages field information. Guided by a general restoration network, the proposed method not only enhances the image quality, but also successively improves the optical performance across multiple lenses that are already in professional level. This joint optimization pipeline offers innovative insights into the practical design of sophisticated optical systems and post-processing algorithms. The source code will be made publicly available at this https URL
最近,光学系统与下游算法的联合设计显示出了巨大的潜力。然而,现有的基于光线描述的方法仅限于优化几何退化,难以完全表示受波前像差或衍射效应限制的复杂微型镜头的光学特性。在这项工作中,我们引入了一个精确的光学仿真模型,并且该管道中的每一步操作都是可微分的。该模型采用了一种新颖的初始值策略来提高在高非球面上的交点计算可靠性。此外,它还利用了差分算子,在相干点扩散函数计算过程中减少内存消耗。为了有效地处理各种退化现象,我们设计了一个基于场信息的联合优化过程。在通用恢复网络的指导下,提出的方法不仅提高了图像质量,而且逐步提升了已经在专业水平上的多种镜头的光学性能。这个联合优化管道为复杂光学系统和后处理算法的实际设计提供了创新见解。源代码将在以下链接公开:[此 https URL]
https://arxiv.org/abs/2412.14603
Optical Character Recognition (OCR) technology has revolutionized the digitization of printed text, enabling efficient data extraction and analysis across various domains. Just like Machine Translation systems, OCR systems are prone to errors. In this work, we address the challenge of data generation and post-OCR error correction, specifically for low-resource languages. We propose an approach for synthetic data generation for Devanagari languages, RoundTripOCR, that tackles the scarcity of the post-OCR Error Correction datasets for low-resource languages. We release post-OCR text correction datasets for Hindi, Marathi, Bodo, Nepali, Konkani and Sanskrit. We also present a novel approach for OCR error correction by leveraging techniques from machine translation. Our method involves translating erroneous OCR output into a corrected form by treating the OCR errors as mistranslations in a parallel text corpus, employing pre-trained transformer models to learn the mapping from erroneous to correct text pairs, effectively correcting OCR errors.
光学字符识别(OCR)技术彻底改变了印刷文本的数字化过程,实现了在各个领域中高效的数据提取和分析。就像机器翻译系统一样,OCR 系统也容易出错。在这项工作中,我们解决了数据生成和OCR后错误校正的问题,特别是针对资源匮乏的语言。我们提出了一种用于合成数据生成的方法——RoundTripOCR,专门解决低资源语言的OCR后纠错数据集稀缺问题。我们发布了印地语、马拉地语、博多语、尼泊尔语、孔卡尼语和梵语的OCR文本校正数据集。此外,我们还提出了一种新的OCR错误校正方法,利用了机器翻译技术。我们的方法是将错误的OCR输出通过预训练的变换模型进行处理,将其视为平行文本语料库中的误译,并学习从错误文本到正确文本的映射关系,从而有效地纠正OCR错误。
https://arxiv.org/abs/2412.15248
License plate recognition (LPR) involves automated systems that utilize cameras and computer vision to read vehicle license plates. Such plates collected through LPR can then be compared against databases to identify stolen vehicles, uninsured drivers, crime suspects, and more. The LPR system plays a significant role in saving time for institutions such as the police force. In the past, LPR relied heavily on Optical Character Recognition (OCR), which has been widely explored to recognize characters in images. Usually, collected plate images suffer from various limitations, including noise, blurring, weather conditions, and close characters, making the recognition complex. Existing LPR methods still require significant improvement, especially for distorted images. To fill this gap, we propose utilizing visual language models (VLMs) such as OpenAI GPT4o, Google Gemini 1.5, Google PaliGemma (Pathways Language and Image model + Gemma model), Meta Llama 3.2, Anthropic Claude 3.5 Sonnet, LLaVA, NVIDIA VILA, and moondream2 to recognize such unclear plates with close characters. This paper evaluates the VLM's capability to address the aforementioned problems. Additionally, we introduce ``VehiclePaliGemma'', a fine-tuned Open-sourced PaliGemma VLM designed to recognize plates under challenging conditions. We compared our proposed VehiclePaliGemma with state-of-the-art methods and other VLMs using a dataset of Malaysian license plates collected under complex conditions. The results indicate that VehiclePaliGemma achieved superior performance with an accuracy of 87.6\%. Moreover, it is able to predict the car's plate at a speed of 7 frames per second using A100-80GB GPU. Finally, we explored the multitasking capability of VehiclePaliGemma model to accurately identify plates containing multiple cars of various models and colors, with plates positioned and oriented in different directions.
车牌识别(LPR)涉及使用摄像头和计算机视觉的自动化系统来读取车辆的车牌。通过LPR收集到的这些车牌可以与数据库进行比较,以识别被盗车辆、未投保司机、犯罪嫌疑人等。LPR系统在节省警察等机构的时间方面发挥着重要作用。在过去,LPR主要依赖于光学字符识别(OCR),这一技术已被广泛用于图像中的字符识别。通常,采集到的车牌图像会受到噪声、模糊、天气条件和字符间距过近等因素的影响,使得识别变得复杂。现有的LPR方法仍需显著改进,特别是在处理变形图像时。为填补这一空白,我们提出利用视觉语言模型(VLM),如OpenAI GPT4o、Google Gemini 1.5、Google PaliGemma(Pathways Language and Image model + Gemma model)、Meta Llama 3.2、Anthropic Claude 3.5 Sonnet、LLaVA、NVIDIA VILA和moondream2,来识别字符间距过近的模糊车牌。本文评估了VLM在解决上述问题方面的能力。此外,我们引入了“VehiclePaliGemma”,这是一种针对挑战条件下车牌识别优化的开源PaliGemma视觉语言模型。我们使用一组在复杂条件下收集的马来西亚车牌数据集将我们的VehiclePaliGemma与最先进的方法和其他VLM进行了比较。结果显示,VehiclePaliGemma实现了87.6%的准确率,并且使用A100-80GB GPU以每秒处理7帧的速度预测车牌信息。最后,我们探索了VehiclePaliGemma模型的多任务能力,能够准确识别包含不同型号和颜色车辆、位置和方向不同的多个车牌。
https://arxiv.org/abs/2412.14197
We implemented a high-performance optical character recognition model for classical handwritten documents using data augmentation with highly variable cropping within the document region. Optical character recognition in handwritten documents, especially classical documents, has been a challenging topic in many countries and research organizations due to its difficulty. Although many researchers have conducted research on this topic, the quality of classical texts over time and the unique stylistic characteristics of various authors have made it difficult, and it is clear that the recognition of hanja handwritten documents is a meaningful and special challenge, especially since hanja, which has been developed by reflecting the vocabulary, semantic, and syntactic features of the Joseon Dynasty, is different from classical Chinese characters. To study this challenge, we used 1100 cursive documents, which are small in size, and augmented 100 documents per document by cropping a randomly sized region within each document for training, and trained them using a two-stage object detection model, High resolution neural network (HRNet), and applied the resulting model to achieve a high inference recognition rate of 90% for cursive documents. Through this study, we also confirmed that the performance of OCR is affected by the simplified characters, variants, variant characters, common characters, and alternators of Chinese characters that are difficult to see in other studies, and we propose that the results of this study can be applied to optical character recognition of modern documents in multiple languages as well as other typefaces in classical documents.
我们通过在文档区域内使用高度变化的裁剪进行数据增强,实现了一个高性能的手写光学字符识别模型,专门用于古典手写文件。手写文档中的光学字符识别,特别是古典文档,一直是许多国家和研究机构面临的难题。尽管许多研究人员已经在这个话题上进行了研究,但由于时间对古典文本质量的影响以及不同作者的独特风格特征,这个问题变得尤为困难。尤其是对于由朝鲜王朝的词汇、语义和句法特点发展而来的汉字手写文档识别来说,这是一项有意义且特殊的挑战,因为汉字与传统的汉字有所不同。为了应对这一挑战,我们使用了1100份小尺寸的草书文件,并通过随机裁剪每个文档内的区域来增强每份文档至100份进行训练。采用两阶段对象检测模型和高分辨率神经网络(HRNet)对其进行训练,并将该模型应用于实现对草书文档90%的推理识别率。通过这项研究,我们也确认了OCR性能受到汉字简化字、异体字、变体字、常用字以及替换字的影响,这些在其他研究中难以观察到的特点。我们建议,本研究的结果可以用于多种语言的现代文档光学字符识别以及其他类型的古典文档中。
https://arxiv.org/abs/2412.10647
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage DeepSeekMoE models with the Multi-head Latent Attention mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference and high throughput. Trained on an improved vision-language dataset, DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. Codes and pre-trained models are publicly accessible at this https URL.
我们推出了DeepSeek-VL2,这是一个先进的大型专家混合(MoE)视觉-语言模型系列,与它的前身DeepSeek-VL相比,通过两项主要升级显著提升了性能。在视觉组件方面,我们采用了动态分块的视觉编码策略,专门用于处理不同纵横比的高分辨率图像。在语言组件方面,我们利用了带有多头潜在注意力机制的DeepSeekMoE模型,该机制将键值缓存压缩成潜在向量,从而实现高效的推理和高吞吐量。经过改进的视觉-语言数据集训练后,DeepSeek-VL2在包括但不限于视觉问答、光学字符识别、文档/表格/图表理解和视觉定位等多种任务中展现了卓越的能力。我们的模型系列由三个变体组成:DeepSeek-VL2-Tiny, DeepSeek-VL2-Small 和 DeepSeek-VL2,分别拥有10亿、28亿和45亿个激活参数。与现有的开源密集型和MoE基础模型相比,DeepSeek-VL2在相似或更少的激活参数下实现了竞争力或最先进的性能。代码和预训练模型可以在以下链接公开获取:[此 https URL]。
https://arxiv.org/abs/2412.10302
Vision-language models have made significant strides recently, demonstrating superior performance across a range of tasks, e.g. optical character recognition and complex diagram analysis. Building on this trend, we introduce a new vision-language model, POINTS1.5, designed to excel in various real-world applications. POINTS1.5 is an enhancement of POINTS1.0 and incorporates several key innovations: i) We replace the original CLIP vision encoder, which had a fixed image resolution, with a NaViT-style vision encoder that supports native dynamic high resolution. This allows POINTS1.5 to process images of any resolution without needing to split them into tiles. ii) We add bilingual support to POINTS1.5, significantly enhancing its capability in Chinese. Due to the scarcity of open-source Chinese datasets for vision-language models, we collect numerous images from the Internet and annotate them using a combination of manual and automatic methods. iii) We propose a set of rigorous filtering methods for visual instruction tuning datasets. We comprehensively evaluate all these filtering methods, and choose the most effective ones to obtain the final visual instruction tuning set. Thanks to these innovations, POINTS1.5 significantly outperforms POINTS1.0 and demonstrates strong performance across a range of real-world applications. Notably, POINTS1.5-7B is trained on fewer than 4 billion tokens and ranks first on the OpenCompass leaderboard among models with fewer than 10 billion parameters
视觉语言模型最近取得了显著进展,在包括光学字符识别和复杂图表分析等一系列任务中展现了卓越的性能。基于这一趋势,我们介绍了一种新的视觉语言模型 POINTS1.5,该模型旨在在各种实际应用中表现出色。POINTS1.5 是对 POINTS1.0 的改进,并包含了几项关键创新:i) 我们用支持原生动态高分辨率的 NaViT 风格视觉编码器替换了原先固定图像分辨率的 CLIP 视觉编码器。这使得 POINTS1.5 能够处理任意分辨率的图像,而无需将它们分割成小块。ii) 我们为 POINTS1.5 添加了双语支持,显著提升了其中文能力。由于开源视觉语言模型中的中文数据集稀缺,我们从互联网上收集了许多图片,并通过结合手动和自动方法对其进行标注。iii) 我们提出了一套严格的过滤方法来处理视觉指令调优数据集。全面评估所有这些过滤方法后,选择了最有效的几种以获得最终的视觉指令调优集。得益于这些创新,POINTS1.5 显著超越了 POINTS1.0,并在一系列实际应用中表现出色。值得注意的是,在参数量少于 100 亿的模型中,训练数据少于 40 亿个 token 的 POINTS1.5-7B 在 OpenCompass 排行榜上排名第一。
https://arxiv.org/abs/2412.08443