Learning multimodal video understanding typically relies on datasets comprising video clips paired with manually annotated captions. However, this becomes even more challenging when dealing with long-form videos, lasting from minutes to hours, in educational and news domains due to the need for more annotators with subject expertise. Hence, there arises a need for automated solutions. Recent advancements in Large Language Models (LLMs) promise to capture concise and informative content that allows the comprehension of entire videos by leveraging Automatic Speech Recognition (ASR) and Optical Character Recognition (OCR) technologies. ASR provides textual content from audio, while OCR extracts textual content from specific frames. This paper introduces a dataset comprising long-form lectures and news videos. We present baseline approaches to understand their limitations on this dataset and advocate for exploring prompt engineering techniques to comprehend long-form multimodal video datasets comprehensively.
学习多模态视频理解通常依赖于包含手动注释字幕的视频剪辑的数据集。然而,当处理教育和新闻领域的长格式视频(从几分钟到几小时不等)时,由于需要具有专业知识的更多标注者,这一任务变得更加困难。因此,自动解决方案的需求应运而生。 近年来,在大型语言模型(LLMs)方面的发展承诺能够捕捉简洁且有信息量的内容,通过利用自动语音识别(ASR)和光学字符识别(OCR)技术来理解整个视频内容。ASR将音频转换为文本,而OCR则从特定帧中提取文本内容。本文介绍了一个包含长格式讲座和新闻视频的数据集。我们展示了在该数据集上的一些基础方法,并阐述了它们的局限性,同时提倡探索提示工程(prompt engineering)技术来全面理解长格式多模态视频数据集。
https://arxiv.org/abs/2503.08335
Navigating urban environments poses significant challenges for people with disabilities, particularly those with blindness and low vision. Environments with dynamic and unpredictable elements like construction sites are especially challenging. Construction sites introduce hazards like uneven surfaces, obstructive barriers, hazardous materials, and excessive noise, and they can alter routing, complicating safe mobility. Existing assistive technologies are limited, as navigation apps do not account for construction sites during trip planning, and detection tools that attempt hazard recognition struggle to address the extreme variability of construction paraphernalia. This study introduces a novel computer vision-based system that integrates open-vocabulary object detection, a YOLO-based scaffolding-pole detection model, and an optical character recognition (OCR) module to comprehensively identify and interpret construction site elements for assistive navigation. In static testing across seven construction sites, the system achieved an overall accuracy of 88.56\%, reliably detecting objects from 2m to 10m within a 0$^\circ$ -- 75$^\circ$ angular offset. At closer distances (2--4m), the detection rate was 100\% at all tested angles. At
在城市环境中导航对残障人士,尤其是视力障碍者来说是一个重大挑战。具有动态和不可预测元素的环境(如建筑工地)尤其具有挑战性。施工现场引入了不平坦的地面、阻碍通行的障碍物、危险材料以及噪音过大的情况,并且它们会改变路线,使安全移动变得复杂化。现有的辅助技术是有限的,因为导航应用程序在行程规划中没有考虑到建筑工地,而试图识别危害的检测工具也难以应对建筑设备的高度变化性。 这项研究介绍了一种基于计算机视觉的新系统,该系统整合了开放词汇的对象检测、一种以YOLO为基础的脚手架杆检测模型以及一个光学字符识别(OCR)模块。这一体系旨在全面地识别和解释施工现场中的各种元素,以便进行辅助导航。在七个建筑工地中进行静态测试时,此系统的整体准确率达到了88.56%,可靠地从2米到10米的距离内以0°至75°的角度偏移范围内检测对象。在较近距离(2--4米)下,所有测试角度下的识别率为100%。
https://arxiv.org/abs/2503.04139
Automated defect detection in industrial manufacturing is essential for maintaining product quality and minimizing production errors. In air disc brake manufacturing, ensuring the precision of laser-engraved nameplates is crucial for accurate product identification and quality control. Engraving errors, such as misprints or missing characters, can compromise both aesthetics and functionality, leading to material waste and production delays. This paper presents a proof of concept for an AI-driven computer vision system that inspects and verifies laser-engraved nameplates, detecting defects in logos and alphanumeric strings. The system integrates object detection using YOLOv7, optical character recognition (OCR) with Tesseract, and anomaly detection through a residual variational autoencoder (ResVAE) along with other computer vision methods to enable comprehensive inspections at multiple stages. Experimental results demonstrate the system's effectiveness, achieving 91.33% accuracy and 100% recall, ensuring that defective nameplates are consistently detected and addressed. This solution highlights the potential of AI-driven visual inspection to enhance quality control, reduce manual inspection efforts, and improve overall manufacturing efficiency.
在工业制造中,自动缺陷检测对于保持产品质量和减少生产错误至关重要。特别是在空气盘制动器的制造过程中,确保激光雕刻铭牌的精度对于准确的产品识别和质量控制是至关重要的。例如,打印错误或缺失字符等雕刻错误不仅会影响产品的美观性,还可能影响其功能性,导致材料浪费和生产延误。 本文提出了一种基于人工智能的计算机视觉系统的概念验证,该系统用于检查并验证激光雕刻铭牌中的缺陷,如徽标及字母数字字符串上的错误。此系统结合了使用YOLOv7进行对象检测、Tesseract进行光学字符识别(OCR)以及通过残差变分自编码器(ResVAE)及其他计算机视觉方法来进行异常检测,从而在多个阶段实现全面检查。 实验结果表明该系统的有效性,在准确率上达到了91.33%,召回率为100%,确保了对缺陷铭牌的一致检测和处理。这一解决方案突显了基于人工智能的视觉检测技术在提升质量控制、减少人工检查工作量以及提高整体制造效率方面的潜力。
https://arxiv.org/abs/2503.03395
Handwritten text recognition (HTR) remains a challenging task, particularly for multi-page documents where pages share common formatting and contextual features. While modern optical character recognition (OCR) engines are proficient with printed text, their performance on handwriting is limited, often requiring costly labeled data for fine-tuning. In this paper, we explore the use of multi-modal large language models (MLLMs) for transcribing multi-page handwritten documents in a zero-shot setting. We investigate various configurations of commercial OCR engines and MLLMs, utilizing the latter both as end-to-end transcribers and as post-processors, with and without image components. We propose a novel method, '+first page', which enhances MLLM transcription by providing the OCR output of the entire document along with just the first page image. This approach leverages shared document features without incurring the high cost of processing all images. Experiments on a multi-page version of the IAM Handwriting Database demonstrate that '+first page' improves transcription accuracy, balances cost with performance, and even enhances results on out-of-sample text by extrapolating formatting and OCR error patterns from a single page.
手写文本识别(HTR)仍然是一个具有挑战性的任务,特别是在多页文档中,这些文档共享相同的格式和上下文特征。尽管现代光学字符识别(OCR)引擎在处理印刷文本方面表现出色,但它们对手写文本的性能有限,并且通常需要昂贵的标注数据来进行微调。在这篇论文中,我们探讨了使用多模态大型语言模型(MLLMs)以零样本设置转录多页手写文档的方法。我们研究了各种商业OCR引擎和MLLM配置,利用后者作为端到端转录器以及有或没有图像组件的后处理工具。 我们提出了一种新颖的方法,“+首页面”,通过提供整个文档的OCR输出加上仅第一张图片来增强MLLM转录效果。这种方法利用了共享的文档特征,同时避免了处理所有图像所需的高昂成本。在IAM手写数据库的多页版本上的实验表明,“+首页面”方法提高了转录准确性,并且能够在不增加成本的情况下平衡性能;此外,在样本外文本上,该方法还能通过从单张图片中推断格式和OCR错误模式来提高结果质量。
https://arxiv.org/abs/2502.20295
Visual navigation devices require precise calibration to achieve high-precision localization and navigation, which includes camera and attitude calibration. To address the limitations of time-consuming camera calibration and complex attitude adjustment processes, this study presents a collimator-based calibration method and system. Based on the optical characteristics of the collimator, a single-image camera calibration algorithm is introduced. In addition, integrated with the precision adjustment mechanism of the calibration frame, a rotation transfer model between coordinate systems enables efficient attitude calibration. Experimental results demonstrate that the proposed method achieves accuracy and stability comparable to traditional multi-image calibration techniques. Specifically, the re-projection errors are less than 0.1463 pixels, and average attitude angle errors are less than 0.0586 degrees with a standard deviation less than 0.0257 degrees, demonstrating high precision and robustness.
视觉导航设备需要精确的校准以实现高精度定位和导航,这包括相机标定和姿态调整。为了克服传统相机标定耗时长以及复杂的态度调节过程带来的限制,本研究提出了一种基于平行光管(collimator)的校准方法和系统。基于平行光管的光学特性,引入了单幅图像相机标定算法,并结合校准框架中的精密调节机构,通过坐标系之间的旋转转换模型实现了高效的姿态调整。 实验结果显示,所提出的这种方法在精度和稳定性方面与传统的多幅图像标定技术相当。具体而言,重投影误差小于0.1463像素,平均姿态角度误差小于0.0586度,标准差小于0.0257度,表现出高度的精确性和鲁棒性。
https://arxiv.org/abs/2502.18012
Optical Character Recognition (OCR) plays a crucial role in digitizing historical and multilingual documents, yet OCR errors -- imperfect extraction of the text, including character insertion, deletion and permutation -- can significantly impact downstream tasks like question-answering (QA). In this work, we introduce a multilingual QA dataset MultiOCR-QA, designed to analyze the effects of OCR noise on QA systems' performance. The MultiOCR-QA dataset comprises 60K question-answer pairs covering three languages, English, French, and German. The dataset is curated from OCR-ed old documents, allowing for the evaluation of OCR-induced challenges on question answering. We evaluate MultiOCR-QA on various levels and types of OCR errors to access the robustness of LLMs in handling real-world digitization errors. Our findings show that QA systems are highly prone to OCR induced errors and exhibit performance degradation on noisy OCR text.
光学字符识别(OCR)在数字化历史和多语言文档方面扮演着关键角色,然而,OCR错误——包括字符插入、删除和置换导致的文本提取不准确——会对诸如问答系统(QA)这样的下游任务产生重大影响。在这项工作中,我们引入了一个名为MultiOCR-QA的多语言问答数据集,旨在分析OCR噪声对问答系统性能的影响。该数据集包含6万个多语言问题-答案对,覆盖英语、法语和德语三种语言。这些数据来源于经过OCR处理的老文档,以评估由OCR引发的问题回答挑战。我们通过多种水平和类型的OCR错误来测试MultiOCR-QA,以此衡量大模型在应对现实世界数字化错误时的稳健性。我们的研究结果表明,问答系统容易受到OCR引入的错误的影响,并且在处理有噪声的OCR文本时表现出性能下降。
https://arxiv.org/abs/2502.16781
With the growing adoption of Retrieval-Augmented Generation (RAG) in document processing, robust text recognition has become increasingly critical for knowledge extraction. While OCR (Optical Character Recognition) for English and other languages benefits from large datasets and well-established benchmarks, Arabic OCR faces unique challenges due to its cursive script, right-to-left text flow, and complex typographic and calligraphic features. We present KITAB-Bench, a comprehensive Arabic OCR benchmark that fills the gaps in current evaluation systems. Our benchmark comprises 8,809 samples across 9 major domains and 36 sub-domains, encompassing diverse document types including handwritten text, structured tables, and specialized coverage of 21 chart types for business intelligence. Our findings show that modern vision-language models (such as GPT-4, Gemini, and Qwen) outperform traditional OCR approaches (like EasyOCR, PaddleOCR, and Surya) by an average of 60% in Character Error Rate (CER). Furthermore, we highlight significant limitations of current Arabic OCR models, particularly in PDF-to-Markdown conversion, where the best model Gemini-2.0-Flash achieves only 65% accuracy. This underscores the challenges in accurately recognizing Arabic text, including issues with complex fonts, numeral recognition errors, word elongation, and table structure detection. This work establishes a rigorous evaluation framework that can drive improvements in Arabic document analysis methods and bridge the performance gap with English OCR technologies.
随着检索增强生成(RAG)在文档处理中的广泛应用,强大的文本识别技术对于知识提取变得越来越重要。尽管光学字符识别(OCR)技术已经在英语和其他语言中得益于大规模的数据集和成熟的评估基准,但阿拉伯语的OCR面临着独特的挑战,包括其连笔书写、从右向左的文字流动以及复杂的排版和书法特征。为此,我们提出了KITAB-Bench,这是一个全面的阿拉伯语OCR评估基准,旨在填补现有评价系统中的空白。 我们的基准测试包含了来自9个主要领域和36个子领域的8,809个样本,涵盖了各种文档类型,包括手写文本、结构化表格以及为商业智能特别设计的21种图表类型。研究结果显示,现代视觉-语言模型(如GPT-4、Gemini和Qwen)在字符错误率(CER)方面平均比传统OCR方法(例如EasyOCR、PaddleOCR和Surya)高出60%。 此外,我们还指出现有阿拉伯语OCR模型在PDF到Markdown转换方面的显著局限性。其中表现最好的模型Gemini-2.0-Flash仅能达到65%的准确率。这凸显了准确识别阿拉伯文文本的挑战,包括复杂字体、数字识别错误、单词延长以及表格结构检测等问题。 这项工作建立了一个严格的评估框架,能够推动阿拉伯语文档分析方法的改进,并缩小其与英语OCR技术在性能上的差距。
https://arxiv.org/abs/2502.14949
Oscar Wilde said, "The difference between literature and journalism is that journalism is unreadable, and literature is not read." Unfortunately, The digitally archived journalism of Oscar Wilde's 19th century often has no or poor quality Optical Character Recognition (OCR), reducing the accessibility of these archives and making them unreadable both figuratively and literally. This paper helps address the issue by performing OCR on "The Nineteenth Century Serials Edition" (NCSE), an 84k-page collection of 19th-century English newspapers and periodicals, using Pixtral 12B, a pre-trained image-to-text language model. The OCR capability of Pixtral was compared to 4 other OCR approaches, achieving a median character error rate of 1%, 5x lower than the next best model. The resulting NCSE v2.0 dataset features improved article identification, high-quality OCR, and text classified into four types and seventeen topics. The dataset contains 1.4 million entries, and 321 million words. Example use cases demonstrate analysis of topic similarity, readability, and event tracking. NCSE v2.0 is freely available to encourage historical and sociological research. As a result, 21st-century readers can now share Oscar Wilde's disappointment with 19th-century journalistic standards, reading the unreadable from the comfort of their own computers.
奥斯卡·王尔德曾说:“文学与新闻的区别在于,新闻难以阅读,而文学却无人问津。”不幸的是,在19世纪奥斯卡·王尔德的时代,数字化存档的新闻经常由于光学字符识别(OCR)质量差或缺失,变得难以访问且字面意义上难以阅读。为解决这一问题,本研究使用预训练的图像转文本语言模型Pixtral 12B对《十九世纪连载版》(NCSE)进行OCR处理,《十九世纪连载版》是一个包含84,000页19世纪英文报纸和期刊的大规模集合。通过将Pixtral的OCR能力与另外四种方法进行比较,我们实现了中值字符错误率仅为1%,比第二好的模型低5倍的成绩。 改进后的NCSE v2.0数据集具有优化的文章识别、高质量的OCR以及分类为四类和十七个主题的文本内容。该数据集包含140万条记录和3亿2千万字,其示例用例包括主题相似性分析、可读性和事件跟踪等应用。 NCSE v2.0数据集免费提供给研究者使用,以促进历史和社会学的研究。因此,21世纪的读者现在可以在自己的电脑上阅读并体验到奥斯卡·王尔德对19世纪新闻标准的失望之情。
https://arxiv.org/abs/2502.14901
Visual Question Answering (VQA) is a challenging problem that requires to process multimodal input. Answer-Set Programming (ASP) has shown great potential in this regard to add interpretability and explainability to modular VQA architectures. In this work, we address the problem of how to integrate ASP with modules for vision and natural language processing to solve a new and demanding VQA variant that is concerned with images of graphs (not graphs in symbolic form). Images containing graph-based structures are an ubiquitous and popular form of visualisation. Here, we deal with the particular problem of graphs inspired by transit networks, and we introduce a novel dataset that amends an existing one by adding images of graphs that resemble metro lines. Our modular neuro-symbolic approach combines optical graph recognition for graph parsing, a pretrained optical character recognition neural network for parsing labels, Large Language Models (LLMs) for language processing, and ASP for reasoning. This method serves as a first baseline and achieves an overall average accuracy of 73% on the dataset. Our evaluation provides further evidence of the potential of modular neuro-symbolic systems, in particular with pretrained models that do not involve any further training and logic programming for reasoning, to solve complex VQA tasks.
视觉问答(Visual Question Answering,VQA)是一个挑战性的问题,需要处理多模态输入。回答集编程(Answer-Set Programming,ASP)在此方面表现出巨大潜力,可为模块化VQA架构增加解释性和透明度。在这项工作中,我们解决了如何将ASP与用于视觉和自然语言处理的模块相结合以解决一个新的、更具挑战性的VQA变体的问题,该变体关注的是图像中的图形(而不是符号形式的图)。包含基于图形结构的图像是一种普遍存在且流行的可视化方式。在这里,我们处理由公共交通网络启发的特定问题,并引入了一个新的数据集,通过添加类似地铁线路的图表图像来改进现有数据集。 我们的模块化神经符号方法结合了光学图识别用于解析图、预训练的光学字符识别神经网络用于解析标签、大型语言模型(LLMs)用于自然语言处理以及ASP进行推理。这种方法作为第一个基线,在数据集上实现了平均73%的整体准确率。我们的评估进一步证明,模块化神经符号系统在解决复杂的VQA任务方面具有潜力,特别是使用无需进一步训练的预训练模型和逻辑编程进行推理的方法。
https://arxiv.org/abs/2502.09211
This paper introduces an open-source benchmark for evaluating Vision-Language Models (VLMs) on Optical Character Recognition (OCR) tasks in dynamic video environments. We present a curated dataset containing 1,477 manually annotated frames spanning diverse domains, including code editors, news broadcasts, YouTube videos, and advertisements. Three state of the art VLMs - Claude-3, Gemini-1.5, and GPT-4o are benchmarked against traditional OCR systems such as EasyOCR and RapidOCR. Evaluation metrics include Word Error Rate (WER), Character Error Rate (CER), and Accuracy. Our results highlight the strengths and limitations of VLMs in video-based OCR tasks, demonstrating their potential to outperform conventional OCR models in many scenarios. However, challenges such as hallucinations, content security policies, and sensitivity to occluded or stylized text remain. The dataset and benchmarking framework are publicly available to foster further research.
这篇论文介绍了一个开源基准,用于评估视觉-语言模型(VLMs)在动态视频环境中进行光学字符识别(OCR)任务的效果。我们提供了一套经过精心策划的数据集,包含1,477个手动标注的帧,涵盖了代码编辑器、新闻广播、YouTube视频和广告等不同的领域。三个最先进的VLM——Claude-3、Gemini-1.5 和 GPT-4 与传统的OCR系统(如EasyOCR和RapidOCR)进行了对比测试。评估指标包括词错误率(WER)、字符错误率(CER)和准确度。我们的研究结果突显了VLM在基于视频的OCR任务中的优势和局限性,展示了它们在许多场景中超越传统OCR模型的潜力。然而,诸如幻觉、内容安全策略以及对遮挡或装饰性强的文字敏感等问题仍然存在挑战。该数据集和基准测试框架现已公开,旨在促进进一步的研究。
https://arxiv.org/abs/2502.06445
Converting images of Arabic text into plain text is a widely researched topic in academia and industry. However, recognition of Arabic handwritten and printed text presents difficult challenges due to the complex nature of variations of the Arabic script. This work proposes an end-to-end solution for recognizing Arabic handwritten, printed, and Arabic numbers and presents the data in a structured manner. We reached 81.66% precision, 78.82% Recall, and 79.07% F-measure on a Text Detection task that powers the proposed solution. The proposed recognition model incorporates state-of-the-art CNN-based feature extraction, and Transformer-based sequence modeling to accommodate variations in handwriting styles, stroke thicknesses, alignments, and noise conditions. The evaluation of the model suggests its strong performances on both printed and handwritten texts, yielding 0.59% CER and & 1.72% WER on printed text, and 7.91% CER and 31.41% WER on handwritten text. The overall proposed solution has proven to be relied on in real-life OCR tasks. Equipped with both detection and recognition models as well as other Feature Extraction and Matching helping algorithms. With the general purpose implementation, making the solution valid for any given document or receipt that is Arabic handwritten or printed. Thus, it is practical and useful for any given context.
将阿拉伯文图像转换为纯文本是学术界和工业界广泛研究的主题。然而,由于阿拉伯文字体的复杂性及其变体,识别手写和印刷的阿拉伯文文本面临着巨大的挑战。本文提出了一种端到端解决方案,用于识别阿拉伯手写、印刷及阿拉伯数字,并以结构化的方式呈现数据。在支撑该方案的文字检测任务上,我们达到了81.66%的精度(Precision)、78.82%的召回率(Recall)和79.07%的F值(F-measure)。提出的识别模型结合了最先进的基于CNN的特征提取技术和基于Transformer的序列建模技术,以适应书写风格、笔画粗细、对齐方式以及噪音条件的变化。模型评估显示其在印刷文本上表现良好,CER(字符错误率)为0.59%,WER(词错误率)为1.72%;在手写文本上,CER达到7.91%,WER则为31.41%。整体解决方案已被证明在实际OCR任务中可靠有效,并配备了检测和识别模型以及其他特征提取与匹配算法。通过通用实施方式,使其适用于任何给定的阿拉伯文手写或印刷文档。因此,在任何情境下都具有实用性和价值。
https://arxiv.org/abs/2502.05277
Optical Character Recognition (OCR) technology is widely used to extract text from images of documents, facilitating efficient digitization and data retrieval. However, merely extracting text is insufficient when dealing with complex documents. Fully comprehending such documents requires an understanding of their structure -- including formatting, formulas, tables, and the reading order of multiple blocks and columns across multiple pages -- as well as semantic information for detecting elements like footnotes and image captions. This comprehensive understanding is crucial for downstream tasks such as retrieval, document question answering, and data curation for training Large Language Models (LLMs) and Vision Language Models (VLMs). To address this, we introduce Ãclair, a general-purpose text-extraction tool specifically designed to process a wide range of document types. Given an image, Ãclair is able to extract formatted text in reading order, along with bounding boxes and their corresponding semantic classes. To thoroughly evaluate these novel capabilities, we introduce our diverse human-annotated benchmark for document-level OCR and semantic classification. Ãclair achieves state-of-the-art accuracy on this benchmark, outperforming other methods across key metrics. Additionally, we evaluate Ãclair on established benchmarks, demonstrating its versatility and strength across several evaluation standards.
光学字符识别(OCR)技术被广泛用于从文档图像中提取文本,这有助于高效地进行数字化和数据检索。然而,在处理复杂文档时,仅提取文本是不够的。全面理解这些文档不仅需要了解其结构——包括格式、公式、表格以及跨多页多个区块和列的阅读顺序——还需要掌握语义信息以识别诸如脚注和图片说明等元素。这种综合理解对于诸如检索、文档问答及用于训练大型语言模型(LLM)和视觉-语言模型(VLM)的数据整理等下游任务至关重要。 为解决这一问题,我们引入了Éclair,这是一个通用的文本提取工具,专门设计用来处理各种类型的文档。给定一张图片后,Éclair能够按照阅读顺序提取格式化的文本,并附带相应的边界框及其对应的语义类别。为了全面评估这些新功能,我们推出了一个多样的人工注释基准测试,用于文档级别的OCR和语义分类。在这一基准上,Éclair实现了最先进的准确率,在关键指标中超越了其他方法。此外,我们在现有的基准测试中评估了Éclair的表现,展示了它在多个评价标准中的灵活性与强大性能。
https://arxiv.org/abs/2502.04223
Optical Character Recognition (OCR) systems often introduce errors when transcribing historical documents, leaving room for post-correction to improve text quality. This study evaluates the use of open-weight LLMs for OCR error correction in historical English and Finnish datasets. We explore various strategies, including parameter optimization, quantization, segment length effects, and text continuation methods. Our results demonstrate that while modern LLMs show promise in reducing character error rates (CER) in English, a practically useful performance for Finnish was not reached. Our findings highlight the potential and limitations of LLMs in scaling OCR post-correction for large historical corpora.
光学字符识别(OCR)系统在转录历史文档时常常会引入错误,因此需要后续校正以提高文本质量。本研究评估了开放权重的大规模语言模型(LLM)在英语文本和芬兰语文本中用于OCR错误校正的效果。我们探讨了几种策略,包括参数优化、量化、分段长度的影响以及文本续写方法。 我们的结果显示,尽管现代大语言模型在降低英文字符错误率方面表现出潜力,但它们在芬兰文中的实用性能尚未达到理想水平。我们的发现强调了大规模语言模型在扩大历史文献OCR后续校正范围方面的潜在应用和局限性。
https://arxiv.org/abs/2502.01205
The widespread adoption of machine learning (ML) has brought forth diverse models with varying architectures, and data requirements, introducing new challenges in integrating these systems into real-world applications. Traditional solutions often struggle to manage the complexities of connecting heterogeneous models, especially when dealing with varied technical specifications. These limitations are amplified in large-scale, collaborative projects where stakeholders contribute models with different technical specifications. To address these challenges, we developed LoCoML, a low-code framework designed to simplify the integration of diverse ML models within the context of the \textit{Bhashini Project} - a large-scale initiative aimed at integrating AI-driven language technologies such as automatic speech recognition, machine translation, text-to-speech, and optical character recognition to support seamless communication across more than 20 languages. Initial evaluations show that LoCoML adds only a small amount of computational load, making it efficient and effective for large-scale ML integration. Our practical insights show that a low-code approach can be a practical solution for connecting multiple ML models in a collaborative environment.
机器学习(ML)的广泛采用带来了具有不同架构和数据需求的各种模型,这在将这些系统集成到实际应用中时提出了新的挑战。传统解决方案通常难以管理异构模型连接的复杂性,特别是在处理各种技术规范的情况下更是如此。这些问题在大规模协作项目中尤为突出,因为不同的利益相关者提供的模型可能有不同的技术规格。为了解决这些问题,我们开发了LoCoML,这是一个低代码框架,旨在简化在“Bhashini项目”背景下多种机器学习模型的集成。“Bhashini项目”是一个大型倡议,旨在整合以自动语音识别、机器翻译、文本转语音和光学字符识别为代表的AI驱动语言技术,以便支持超过20种语言间的无缝交流。初步评估表明,LoCoML仅增加了少量计算负担,使其成为大规模机器学习集成的有效且高效的解决方案。我们的实际经验表明,在协作环境中使用低代码方法来连接多个机器学习模型是一种实用的解决方案。
https://arxiv.org/abs/2501.14165
Optical Character Recognition (OCR) is crucial to the National Library of Norway's (NLN) digitisation process as it converts scanned documents into machine-readable text. However, for the Sámi documents in NLN's collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in Sámi languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing Sámi texts from NLN's collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an out-of-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machine annotations and synthetic text images can yield accurate OCR for Sámi languages, even with a moderate amount of manually annotated data.
光学字符识别(OCR)对于挪威国家图书馆(NLN)的数字化过程至关重要,因为它可以将扫描文档转换为机器可读文本。然而,对于NLN藏品中的萨米语文献而言,OCR准确性不足。由于OCR质量会影响下游流程,因此评估和改进用于转录萨米语文本的OCR是使这些资源更具可访问性的必要条件。为了满足这一需求,这项工作对三种已建立的OCR方法——Transkribus、Tesseract 和 TrOCR 进行了微调,并对其在转写NLN藏品中的萨米语文本方面的效果进行了评估。我们的结果显示,在处理特定任务时,Transkribus和TrOCR的表现优于Tesseract;然而,对于非同域数据集而言,Tesseract则表现出色。此外,我们发现通过调整预训练模型并用人工标注和合成文本图像补充机器标注,即使在仅有中等量手动标注数据的情况下,也能为萨米语提供准确的OCR识别结果。
https://arxiv.org/abs/2501.07300
Video-based Automatic License Plate Recognition (ALPR) involves extracting vehicle license plate text information from video captures. Traditional systems typically rely heavily on high-end computing resources and utilize multiple frames to recognize license plates, leading to increased computational overhead. In this paper, we propose two methods capable of efficiently extracting exactly one frame per vehicle and recognizing its license plate characters from this single image, thus significantly reducing computational demands. The first method uses Visual Rhythm (VR) to generate time-spatial images from videos, while the second employs Accumulative Line Analysis (ALA), a novel algorithm based on single-line video processing for real-time operation. Both methods leverage YOLO for license plate detection within the frame and a Convolutional Neural Network (CNN) for Optical Character Recognition (OCR) to extract textual information. Experiments on real videos demonstrate that the proposed methods achieve results comparable to traditional frame-by-frame approaches, with processing speeds three times faster.
基于视频的自动车牌识别(ALPR)涉及从视频捕获中提取车辆牌照的文字信息。传统的系统通常依赖高端计算资源,并通过使用多个帧来识别车牌,从而增加了计算负担。在这篇论文中,我们提出了两种方法,能够在每辆车上有效提取一个单独的画面并识别该单个图像中的车牌字符,这大大减少了计算需求。第一种方法利用视觉节奏(VR)从视频生成时空间图象,而第二种则采用累积线分析(ALA),这是一种基于单行视频处理的新型算法,适用于实时操作。 这两种方法都使用YOLO进行帧内的车牌检测,并且通过卷积神经网络(CNN)来进行光学字符识别(OCR),以提取文字信息。在真实视频上的实验表明,所提出的方法能够达到与传统逐帧方法相当的结果,但处理速度提高了三倍。
https://arxiv.org/abs/2501.04750
This research focuses on developing a method for restoring the topology of digital images of paper documents captured by a camera, using algorithms for detection, segmentation, geometry restoration, and dewarping. Our methodology employs deep learning (DL) for document outline detection, followed by computer vision (CV) to create a topological 2D grid using cubic polynomial interpolation and correct nonlinear distortions by remapping the image. Using classical CV methods makes the document topology restoration process more efficient and faster, as it requires significantly fewer computational resources and memory. We developed a new pipeline for automatic document dewarping and reconstruction, along with a framework and annotated dataset to demonstrate its efficiency. Our experiments confirm the promise of our methodology and its superiority over existing benchmarks (including mobile apps and popular DL solutions, such as RectiNet, DocGeoNet, and DocTr++) both visually and in terms of document readability via Optical Character Recognition (OCR) and geometry restoration metrics. This paves the way for creating high-quality digital copies of paper documents and enhancing the efficiency of OCR systems. Project page: this https URL
这项研究致力于开发一种方法,用于恢复通过相机捕捉的纸质文档数字图像的拓扑结构。该方法利用检测、分割、几何恢复和去扭曲等算法实现这一目标。我们的方法采用深度学习(DL)技术来检测文件轮廓,并随后运用计算机视觉(CV)技术创建一个基于立方多项式插值的二维网格,以此纠正非线性失真并重新映射图像。通过使用传统的计算机视觉方法使文档拓扑结构恢复过程更加高效和快速,因为它需要显著较少的计算资源和内存。 我们开发了一套新的自动文件去扭曲与重建流水线,并构建了一个框架及注释数据集来展示其效率。实验结果证实了我们的方法在可视化效果以及通过光学字符识别(OCR)和几何恢复指标评估文档可读性方面均优于现有的基准测试(包括移动应用和流行的深度学习解决方案,如RectiNet、DocGeoNet和DocTr++)。这为创建纸质文档的高质量数字副本并提高OCR系统的效率铺平了道路。项目页面:[此URL](https://this.example.com/)
https://arxiv.org/abs/2501.03145
Generating visual text in natural scene images is a challenging task with many unsolved problems. Different from generating text on artificially designed images (such as posters, covers, cartoons, etc.), the text in natural scene images needs to meet the following four key criteria: (1) Fidelity: the generated text should appear as realistic as a photograph and be completely accurate, with no errors in any of the strokes. (2) Reasonability: the text should be generated on reasonable carrier areas (such as boards, signs, walls, etc.), and the generated text content should also be relevant to the scene. (3) Utility: the generated text can facilitate to the training of natural scene OCR (Optical Character Recognition) tasks. (4) Controllability: The attribute of the text (such as font and color) should be controllable as this http URL this paper, we propose a two stage method, SceneVTG++, which simultaneously satisfies the four aspects mentioned above. SceneVTG++ consists of a Text Layout and Content Generator (TLCG) and a Controllable Local Text Diffusion (CLTD). The former utilizes the world knowledge of multi modal large language models to find reasonable text areas and recommend text content according to the nature scene background images, while the latter generates controllable multilingual text based on the diffusion model. Through extensive experiments, we respectively verified the effectiveness of TLCG and CLTD, and demonstrated the state-of-the-art text generation performance of SceneVTG++. In addition, the generated images have superior utility in OCR tasks like text detection and text recognition. Codes and datasets will be available.
在自然场景图像中生成视觉文本是一项充满挑战的任务,许多问题尚未解决。与在人工设计的图像(如海报、封面、卡通等)上生成文字不同,在自然场景图像中的文字需要满足以下四个关键标准:(1) 真实性:生成的文字应该看起来像一张照片一样逼真,并且完全准确,没有任何笔画错误。(2) 合理性:文本应当出现在合理的载体区域(如板子、标识牌、墙壁等),并且所生成的文本内容也应与场景相关。(3) 实用性:生成的文本能够促进自然场景OCR(光学字符识别)任务的训练。(4) 可控性:文字属性(如字体和颜色)应该可以控制。 在这篇论文中,我们提出了一种两阶段方法——SceneVTG++,它同时满足上述四个方面的需求。SceneVTG++由文本布局及内容生成器(TLCG) 和可控局部文本扩散(CLTD)组成。前者利用多模态大型语言模型的世界知识来寻找合理的文字区域并根据自然场景背景图像推荐文字内容,而后者则基于扩散模型生成可控制的多语言文本。通过广泛的实验,我们分别验证了TLCG和CLTD的有效性,并展示了SceneVTG++在文本生成性能方面的先进水平。此外,所生成的图像在OCR任务(如文本检测、文本识别)中具有极高的实用性。代码及数据集将会公开提供。
https://arxiv.org/abs/2501.02962
Automatic License Plate Recognition (ALPR) involves extracting vehicle license plate information from image or a video capture. These systems have gained popularity due to the wide availability of low-cost surveillance cameras and advances in Deep Learning. Typically, video-based ALPR systems rely on multiple frames to detect the vehicle and recognize the license plates. Therefore, we propose a system capable of extracting exactly one frame per vehicle and recognizing its license plate characters from this singular image using an Optical Character Recognition (OCR) model. Early experiments show that this methodology is viable.
自动车牌识别(ALPR)是指从图像或视频捕捉中提取车辆的车牌信息。由于低成本监控摄像头的广泛普及和深度学习技术的进步,这类系统变得越来越流行。通常,基于视频的ALPR系统依赖于多个帧来检测车辆并识别其车牌。因此,我们提出了一种能够为每辆车精确提取一个帧,并利用光学字符识别(OCR)模型从这一单一图像中识别出车牌字符的系统。早期实验表明这种方法是可行的。
https://arxiv.org/abs/2501.02270
Super-resolution (SR) techniques play a pivotal role in enhancing the quality of low-resolution images, particularly for applications such as security and surveillance, where accurate license plate recognition is crucial. This study proposes a novel framework that combines pixel-based loss with embedding similarity learning to address the unique challenges of license plate super-resolution (LPSR). The introduced pixel and embedding consistency loss (PECL) integrates a Siamese network and applies contrastive loss to force embedding similarities to improve perceptual and structural fidelity. By effectively balancing pixel-wise accuracy with embedding-level consistency, the framework achieves superior alignment of fine-grained features between high-resolution (HR) and super-resolved (SR) license plates. Extensive experiments on the CCPD dataset validate the efficacy of the proposed framework, demonstrating consistent improvements over state-of-the-art methods in terms of PSNR_RGB, PSNR_Y and optical character recognition (OCR) accuracy. These results highlight the potential of embedding similarity learning to advance both perceptual quality and task-specific performance in extreme super-resolution scenarios.
超分辨率(SR)技术在提升低分辨率图像质量方面扮演着重要角色,尤其是在安全和监控等应用中,准确的车牌识别至关重要。本研究提出了一种结合像素损失与嵌入相似性学习的新框架,以应对车牌超分辨率(LPSR)的独特挑战。引入的像素和嵌入一致性损失(PECL)整合了Siamese网络,并应用对比损失来强制提高嵌入的相似度,从而增强感知和结构保真度。通过有效平衡逐像素精度与嵌入级的一致性,该框架实现了高清(HR)车牌与超分辨率(SR)车牌之间细粒度特征的卓越对齐。 在CCPD数据集上的广泛实验验证了所提框架的有效性,在PSNR_RGB、PSNR_Y和光学字符识别(OCR)准确率方面均表现出优于现有方法的持续改进。这些结果突显了嵌入相似性学习在极端超分辨率场景中提升感知质量和任务特定性能的巨大潜力。
https://arxiv.org/abs/2501.01483