Vision-Language Models (VLMs) adapted to remote sensing rely heavily on domain-specific image-text supervision, yet high-quality annotations for satellite and aerial imagery remain scarce and expensive to produce. Prevailing pseudo-labeling pipelines address this gap by distilling knowledge from large frontier models, but this dependence on large teachers is costly, limits scalability, and caps achievable performance at the ceiling of the teacher. We propose OSMDA: a self-contained domain adaptation framework that eliminates this dependency. Our key insight is that a capable base VLM can serve as its own annotation engine: by pairing aerial images with rendered OpenStreetMap (OSM) tiles, we leverage optical character recognition and chart comprehension capabilities of the model to generate captions enriched by OSM's vast auxiliary metadata. The model is then fine-tuned on the resulting corpus with satellite imagery alone, yielding OSMDA-VLM, a domain-adapted VLM that requires no manual labeling and no stronger external model. We conduct exhaustive evaluations spanning 10 benchmarks across image-text-to-text tasks and comparing against 9 competitive baselines. When equally mixed with real data, our method achieves state-of-the-art results, while being substantially cheaper to train than teacher-dependent alternatives. These results suggest that, given a strong foundation model, alignment with crowd-sourced geographic data is a practical and scalable path towards remote sensing domain adaptation. Dataset and model weights will be made publicly available.
视觉-语言模型(VLMs)在遥感领域中的应用高度依赖于特定领域的图像文本监督,然而,高质量的卫星和航空影像标注仍然稀缺且昂贵。现有的伪标签流水线通过从大型前沿模型中提取知识来弥补这一缺口,但这种对大型教师模型的依赖既成本高昂又限制了可扩展性,并将性能上限锁定在教师模型的能力水平上。为此,我们提出了OSMDA:一种自我封闭领域的适应框架,以消除这种依赖性。 我们的关键见解是,有能力的基础VLM可以充当自己的标注引擎:通过将航空影像与渲染的OpenStreetMap(OSM)瓦片配对使用,我们可以利用模型的光学字符识别和图表理解能力生成包含OSM丰富辅助元数据的说明。然后,该模型仅基于卫星图像对其进行微调,从而得到OSMDA-VLM,这是一种无需人工标注且不需要更强大外部模型的领域适应VLM。 我们进行了详尽的评估,涵盖10个跨图-文到文本任务的基准测试,并与9种竞争基线进行比较。当混合使用真实数据时,我们的方法实现了最先进的结果,同时训练成本远低于依赖于教师模型的方法。这些结果表明,在一个强大的基础模型的基础上,与众包地理信息对齐是一种实用且可扩展地走向遥感领域适应的路径。 我们将公开发布该方法的数据集和模型权重。
https://arxiv.org/abs/2603.11804
An Automatic License Plate Recognition (ALPR) system constitutes a crucial element in an intelligent traffic management system. However, the detection of Bangla license plates remains challenging because of the complicated character scheme and uneven layouts. This paper presents a robust Bangla License Plate Recognition system that integrates a deep learning-based object detection model for license plate localization with Optical Character Recognition for text extraction. Multiple object detection architectures, including U-Net and several YOLO (You Only Look Once) variants, are compared for license plate localization. This study proposes a novel two-stage adaptive training strategy built upon the YOLOv8 architecture to improve localization performance. The proposed approach outperforms the established models, achieving an accuracy of 97.83% and an Intersection over Union (IoU) of 91.3%. The text recognition problem is phrased as a sequence generation problem with a VisionEncoderDecoder architecture, with a combination of encoder-decoders evaluated. It was demonstrated that the ViT + BanglaBERT model gives better results at the character level, with a Character Error Rate of 0.1323 and Word Error Rate of 0.1068. The proposed system also shows a consistent performance when tested on an external dataset that has been curated for this study purpose. The dataset offers completely different environment and lighting conditions compared to the training sample, indicating the robustness of the proposed framework. Overall, our proposed system provides a robust and reliable solution for Bangla license plate recognition and performs effectively across diverse real-world scenarios, including variations in lighting, noise, and plate styles. These strengths make it well suited for deployment in intelligent transportation applications such as automated law enforcement and access control.
自动车牌识别(ALPR)系统是智能交通管理系统中的一个关键组成部分。然而,孟加拉语车牌的检测依然具有挑战性,原因是其复杂的字符结构和不规则布局。本文介绍了一种鲁棒的孟加拉语车牌识别系统,该系统集成了基于深度学习的对象检测模型进行车牌定位以及光学字符识别技术来提取文本内容。针对车牌定位任务,对比了包括U-Net和多个YOLO(You Only Look Once)变体在内的多种对象检测架构。本研究提出了一种新颖的两阶段自适应训练策略,并在其基础上提出了基于YOLOv8架构的方法以提高定位性能。所提方法在精度上达到了97.83%,交并比(IoU)为91.3%,优于现有模型的表现。 文本识别问题被表述为一个序列生成任务,采用视觉编码器-解码器架构进行处理,并对多种编码器和解码器的组合进行了评估。结果显示,ViT + BanglaBERT模型在字符级别上表现更佳,其字符错误率为0.1323,单词错误率为0.1068。 该系统还在一个专门为此研究目的而编写的外部数据集上展示了稳定的性能,该数据集提供了与训练样本完全不同且多变的环境和光照条件,证明了所提出的框架具有较强的鲁棒性。总体而言,我们的提议系统为孟加拉语车牌识别提供了一种稳健可靠的解决方案,并能在包括光线变化、噪声及不同车牌样式在内的各种真实世界场景中有效运行。 这些优势使该系统非常适合部署于智能交通应用中,如自动化执法和访问控制等。
https://arxiv.org/abs/2603.10267
Large-scale scientific collaborations, such as the Compact Muon Solenoid (CMS) at CERN, produce a vast and ever-growing corpus of internal documentation. Navigating this complex information landscape presents a significant challenge for both new and experienced researchers, hindering knowledge sharing and slowing down the pace of scientific discovery. To address this, we present a prototype of MITRA, a Retrieval-Augmented Generation (RAG) based system, designed to answer specific, context-aware questions about physics analyses. MITRA employs a novel, automated pipeline using Selenium for document retrieval from internal databases and Optical Character Recognition (OCR) with layout parsing for high-fidelity text extraction. Crucially, MITRA's entire framework, from the embedding model to the Large Language Model (LLM), is hosted on-premise, ensuring that sensitive collaboration data remains private. We introduce a two-tiered vector database architecture that first identifies the relevant analysis from abstracts before focusing on the full documentation, resolving potential ambiguities between different analyses. We demonstrate the prototype's superior retrieval performance against a standard keyword-based baseline on realistic queries and discuss future work towards developing a comprehensive research agent for large experimental collaborations.
大规模的科学合作项目,如欧洲核子研究组织(CERN)的紧凑型缪子线圈(CMS),产生了大量的内部文档,并且这些文档还在不断增长。对于新旧研究人员来说,在这个复杂的知识体系中导航是一项重大挑战,这阻碍了知识共享并减缓了科学研究的步伐。为了解决这个问题,我们介绍了一种名为MITRA的原型系统,这是一个基于检索增强生成(RAG)的设计,旨在回答关于物理分析的具体、上下文感知的问题。 MITRA采用了一条新颖且自动化的管道,使用Selenium从内部数据库中提取文档,并利用光学字符识别(OCR)和布局解析技术进行高质量文本抽取。至关重要的是,MITRA的整个框架,包括嵌入模型到大型语言模型(LLM),都部署在本地服务器上,确保敏感的合作数据保持私密。 我们介绍了一种两级向量数据库架构:首先通过摘要来识别相关分析,然后聚焦于完整文档,解决不同分析之间的潜在歧义问题。我们在实际查询中展示了原型系统的检索性能优于标准关键词基础线,并讨论了开发全面研究代理的未来工作的方向,以服务于大规模实验合作项目。
https://arxiv.org/abs/2603.09800
Document Image Machine Translation (DIMT) seeks to translate text embedded in document images from one language to another by jointly modeling both textual content and page layout, bridging optical character recognition (OCR) and natural language processing (NLP). The DIMT 2025 Challenge advances research on end-to-end document image translation, a rapidly evolving area within multimodal document understanding. The competition features two tracks, OCR-free and OCR-based, each with two subtasks for small (less than 1B parameters) and large (greater than 1B parameters) models. Participants submit a single unified DIMT system, with the option to incorporate provided OCR transcripts. Running from December 10, 2024 to April 20, 2025, the competition attracted 69 teams and 27 valid submissions in total. Track 1 had 34 teams and 13 valid submissions, while Track 2 had 35 teams and 14 valid submissions. In this report, we present the challenge motivation, dataset construction, task definitions, evaluation protocol, and a summary of results. Our analysis shows that large-model approaches establish a promising new paradigm for translating complex-layout document images and highlight substantial opportunities for future research.
文档图像机器翻译(DIMT)旨在通过同时建模文本内容和页面布局,将嵌入在文档图像中的文字从一种语言翻译成另一种语言,从而连接光学字符识别(OCR)与自然语言处理(NLP)。DIMT 2025挑战赛促进了端到端文档图像翻译研究的发展,这是一个多模式文档理解领域的迅速发展的领域。比赛设有两个赛道:无OCR和基于OCR的赛道,每个赛道分为小型模型(参数小于10亿)和大型模型(参数大于10亿)两种子任务。参赛者提交单一综合DIMT系统,并可以选择使用提供的OCR转录文本。从2024年12月10日到2025年4月20日,比赛吸引了69支队伍,总计收到27份有效提交。第一赛道有34支队伍和13份有效提交;第二赛道则有35支队伍和14份有效提交。在本报告中,我们将介绍挑战赛的动机、数据集构建、任务定义、评估协议以及结果总结。我们的分析表明,大型模型方法为翻译复杂布局文档图像建立了一个具有前景的新范式,并突显了未来研究的巨大机会。 以下是报告的主要内容概述: 1. **挑战赛动机**: - 描述DIMT的重要性及其在多模态文档理解领域的地位。 - 阐明比赛的目的是推动跨OCR和NLP的技术进展,特别是在处理复杂布局的文档图像方面。 2. **数据集构建**: - 介绍用于训练、验证及测试的数据集来源与特性。 - 强调如何利用真实世界的场景来设计挑战赛中的任务,确保其对实际应用具有重要意义。 3. **任务定义**: - 明确两个赛道的任务描述和目标:无OCR方法需要直接处理图像内容;而基于OCR的方法可以利用提取的文字信息进行翻译。 - 说明每种模型规模下的具体要求以及评估标准。 4. **评价协议**: - 描述如何评估参赛作品的质量,包括准确率、效率及创新性等方面的考量。 - 提供了评分准则和方法论以确保公正性和透明度。 5. **结果总结与分析**: - 展示竞赛中不同团队的成果及其在各类指标上的表现。 - 分析大型模型相对于较小模型的优势,特别是在处理复杂文档时表现出的能力提升。 这份报告详细记录并评估了DIMT 2025挑战赛的过程和成果,为未来的研究提供了重要的参考。
https://arxiv.org/abs/2603.09392
Car license plate recognition system is an image processing technology used to identify vehicles by capturing their Car License Plates. The car license plate recognition technology is also known as automatic number-plate recognition, automatic vehicle identification, car license plate recognition or optical character recognition for cars. In Malaysia, as the number of vehicle is increasing rapidly nowadays, a pretty great number of vehicle on the road has brought about the considerable demands of car license plate recognition system. Car license plate recognition system can be implemented in electronic parking payment system, highway toll-fee system, traffic surveillance system and as police enforcement tools. Additionally, car license plate recognition system technology also has potential to be combined with various techniques in other different fields like biology, aerospace and so on to achieve the goal of solving some specialized problems.
车牌识别系统是一种图像处理技术,用于通过捕捉车辆的车牌来识别车辆。这项技术也被称为自动号码牌识别、自动车辆识别或汽车光学字符识别等。在马来西亚,由于近年来车辆数量迅速增加,道路上行驶的大量车辆对车牌识别系统提出了相当大的需求。车牌识别系统可以应用于电子停车收费系统、高速公路收费系统、交通监控系统以及警方执法工具中。此外,车牌识别技术也有潜力与其他领域的各种技术(如生物技术和航空航天等)结合使用,以解决一些专业问题。
https://arxiv.org/abs/2603.01016
Khmer is a low-resource language characterized by a complex script, presenting significant challenges for optical character recognition (OCR). While document printed text recognition has advanced because of available datasets, performance on other modalities, such as handwritten and scene text, remains limited by data scarcity. Training modality-specific models for each modality does not allow cross-modality transfer learning, from which modalities with limited data could otherwise benefit. Moreover, deploying many modality-specific models results in significant memory overhead and requires error-prone routing each input image to the appropriate model. On the other hand, simply training on a combined dataset with a non-uniform data distribution across different modalities often leads to degraded performance on underrepresented modalities. To address these, we propose a universal Khmer text recognition (UKTR) framework capable of handling diverse text modalities. Central to our method is a novel modality-aware adaptive feature selection (MAFS) technique designed to adapt visual features according to a particular input image modality and enhance recognition robustness across modalities. Extensive experiments demonstrate that our model achieves state-of-the-art (SoTA) performance. Furthermore, we introduce the first comprehensive benchmark for universal Khmer text recognition, which we release to the community to facilitate future research. Our datasets and models can be accessible via this gated repository\footnote{in review}.
柬埔寨语是一种低资源语言,其特点是拥有复杂的书写系统,这给光学字符识别(OCR)带来了巨大的挑战。尽管由于可用的数据集,印刷文档文本的识别取得了进展,但在其他模态下(如手写和场景文字),性能仍然受限于数据稀缺性。为每种模态训练特定模型不允许跨模态迁移学习,而那些数据较少的模态本可以从这种迁移学习中受益。此外,部署多种特定于模态的模型会导致显著的内存开销,并需要将每个输入图像路由到适当模型的过程(该过程容易出错)。另一方面,在具有不同模态间非均匀数据分布的组合数据集上简单地进行训练往往导致在代表性不足的模态下性能下降。为了解决这些问题,我们提出了一种通用柬埔寨文本识别(UKTR)框架,能够处理各种文本模态。我们的方法的核心是一种新颖的认知模态自适应特征选择(MAFS)技术,旨在根据特定输入图像的模态调整视觉特征,并增强跨模态识别的稳健性。广泛的实验表明,我们的模型达到了最先进的性能。此外,我们还引入了第一个针对通用柬埔寨文本识别的全面基准测试,我们将此发布到社区以促进未来的研究。我们的数据集和模型可以通过这个受控访问仓库获取(注:正在审查中)。
https://arxiv.org/abs/2603.00702
The expansion of retrieval-augmented generation (RAG) into multimodal domains has intensified the challenge for processing complex visual documents, such as financial reports. While page-level chunking and retrieval is a natural starting point, it creates a critical bottleneck: delivering entire pages to the generator introduces excessive extraneous context. This not only overloads the generator's attention mechanism but also dilutes the most salient evidence. Moreover, compressing these information-rich pages into a limited visual token budget further increases the risk of hallucinations. To address this, we introduce AgenticOCR, a dynamic parsing paradigm that transforms optical character recognition (OCR) from a static, full-text process into a query-driven, on-demand extraction system. By autonomously analyzing document layout in a "thinking with images" manner, AgenticOCR identifies and selectively recognizes regions of interest. This approach performs on-demand decompression of visual tokens precisely where needed, effectively decoupling retrieval granularity from rigid page-level chunking. AgenticOCR has the potential to serve as the "third building block" of the visual document RAG stack, operating alongside and enhancing standard Embedding and Reranking modules. Experimental results demonstrate that AgenticOCR improves both the efficiency and accuracy of visual RAG systems, achieving expert-level performance in long document understanding. Code and models are available at this https URL.
将检索增强生成(RAG)扩展到多模态领域,特别是处理复杂的视觉文档如财务报告时,面临的挑战愈发严峻。虽然页面级别的切分和检索是一个自然的起点,但它产生了一个关键瓶颈:向生成器提供整页内容会引入过多不必要的上下文信息。这不仅会使生成器的注意力机制过载,还会稀释最相关的证据。此外,在有限的视觉标记预算内压缩这些富含信息的页面进一步增加了幻觉的风险。 为了解决这些问题,我们提出了AgenticOCR,这是一种动态解析范式,它将光学字符识别(OCR)从静态、全文处理过程转变为一种查询驱动、按需提取系统。通过以“用图像思考”的方式自主分析文档布局,AgenticOCR能够识别并有选择地识别感兴趣区域。这种方法实现了视觉标记的按需解压缩,精确到所需之处,有效区分了检索粒度和固定页面级切分之间的耦合关系。AgenticOCR有可能成为视觉文档RAG栈中的“第三块基石”,与标准的嵌入式模块和重新排序模块并行工作且增强其功能。 实验结果表明,AgenticOCR能够提升视觉RAG系统的效率和准确性,在长文档理解方面达到了专家级水平。代码和模型可以在[此处](https://this https URL)获取。
https://arxiv.org/abs/2602.24134
Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream? We investigate the OCR routing mechanism across three architecture families (Qwen3-VL, Phi-4, InternVL3.5) using causal interventions. By computing activation differences between original images and text-inpainted versions, we identify architecture-specific OCR bottlenecks whose dominant location depends on the vision-language integration strategy: DeepStack models (Qwen) show peak sensitivity at mid-depth (about 50%) for scene text, while single-stage projection models (Phi-4, InternVL) peak at early layers (6-25%), though the exact layer of maximum effect varies across datasets. The OCR signal is remarkably low-dimensional: PC1 captures 72.9% of variance. Crucially, principal component analysis (PCA) directions learned on one dataset transfer to others, demonstrating shared text-processing pathways. Surprisingly, in models with modular OCR circuits (notably Qwen3-VL-4B), OCR removal can improve counting performance (up to +6.9 percentage points), suggesting OCR interferes with other visual processing in sufficiently modular architectures.
视觉语言模型(VLM)可以从图像中读取文本,但光学字符识别(OCR)信息是如何进入语言处理流程的呢?我们通过因果干预的方式,在三种架构家族(Qwen3-VL、Phi-4和InternVL3.5)之间研究了OCR路由机制。通过计算原始图像与带文字填充版本之间的激活差异,我们确定了每个架构特有的OCR瓶颈,其主要位置取决于视觉语言集成策略:深层栈模型(如Qwen)在场景文本中的敏感度峰值出现在中层深度(约50%),而单阶段投影模型(Phi-4和InternVL)则在早期层次(6%-25%)达到峰值,尽管效果最大的具体层次因数据集的不同而变化。OCR信号具有非常低的维度:主成分1可以解释72.9%的变化。关键地,通过一个数据集中学习到的主要成分分析(PCA)方向可以在其他数据集中转移,证明了文本处理路径存在共享机制。出乎意料的是,在拥有模块化OCR电路的模型中(尤其是Qwen3-VL-4B),去除OCR信息可以提高计数性能(最多提升6.9个百分点),这表明在足够模块化的架构中,OCR干扰了其他视觉处理过程。
https://arxiv.org/abs/2602.22918
Optical character recognition (OCR) has advanced rapidly with deep learning and multimodal models, yet most methods focus on well-resourced scripts such as Latin and Chinese. Ethnic minority languages remain underexplored due to complex writing systems, scarce annotations, and diverse historical and modern forms, making generalization in low-resource or zero-shot settings challenging. To address these challenges, we present OmniOCR, a universal framework for ethnic minority scripts. OmniOCR introduces Dynamic Low-Rank Adaptation (Dynamic LoRA) to allocate model capacity across layers and scripts, enabling effective adaptation while preserving knowledge.A sparsity regularization prunes redundant updates, ensuring compact and efficient adaptation without extra inference cost. Evaluations on TibetanMNIST, Shui, ancient Yi, and Dongba show that OmniOCR outperforms zero-shot foundation models and standard post training, achieving state-of-the-art accuracy with superior parameter efficiency, and compared with the state-of-the-art baseline models, it improves accuracy by 39%-66% on these four datasets. Code: this https URL.
光学字符识别(OCR)技术在深度学习和多模态模型的推动下迅速发展,然而大多数方法主要集中在资源丰富的文字系统上,如拉丁语系和中文。由于书写系统的复杂性、标注数据的稀缺以及历史与现代形式的多样性,少数民族语言的OCR技术进展缓慢,并且在低资源或零样本设置下的泛化能力面临挑战。为了解决这些问题,我们提出了OmniOCR——一个用于处理少数民族文字的通用框架。 OmniOCR引入了动态低秩适配(Dynamic Low-Rank Adaptation, Dynamic LoRA)机制,在不同层和书写系统之间分配模型容量,从而能够有效地进行适应同时保留知识。此外,稀疏正则化通过剪枝冗余更新来确保高效且紧凑的适应过程,并不会增加额外的推理成本。 在藏文MNIST、水语、古彝文以及东巴文字上的评估表明,OmniOCR不仅超越了零样本基础模型和标准后训练方法,在参数效率方面达到了最新的技术水平。与当前最先进的基线模型相比,它在这四个数据集上分别提高了39%-66%的准确性。 代码链接:[这个 URL](https://this-url.com)
https://arxiv.org/abs/2602.21042
Comparative evaluation of several systems is a recurrent task in researching. It is a key step before deciding which system to use for our work, or, once our research has been conducted, to demonstrate the potential of the resulting model. Furthermore, it is the main task of competitive, public challenges evaluation. Our proposed software (DEEP) automates both the execution and scoring of machine translation and optical character recognition models. Furthermore, it is easily extensible to other tasks. DEEP is prepared to receive dockerized systems, run them (extracting information at that same time), and assess hypothesis against some references. With this approach, evaluators can achieve a better understanding of the performance of each model. Moreover, the software uses a clustering algorithm based on a statistical analysis of the significance of the results yielded by each model, according to the evaluation metrics. As a result, evaluators are able to identify clusters of performance among the swarm of proposals and have a better understanding of the significance of their differences. Additionally, we offer a visualization web-app to ensure that the results can be adequately understood and interpreted. Finally, we present an exemplary case of use of DEEP.
在研究过程中,多个系统之间的比较评估是一项反复进行的任务。这是决定使用哪种系统来开展我们工作的关键步骤;或者,在完成我们的研究后,展示所获得模型的潜力时也是必不可少的。此外,它还是公开竞赛中评价的主要任务。 我们提出的软件(DEEP)可以自动化执行和评分机器翻译及光学字符识别模型的过程,并且很容易扩展到其他任务上。DEEP准备接收docker化的系统,运行它们的同时提取相关信息,并根据某些参考评估假设。通过这种方法,评审人员能够更好地理解每个模型的表现。此外,该软件使用了一种基于统计分析结果显著性的聚类算法,这些结果来自于每种模型的评价指标。因此,评审者可以识别出提案群中的表现集群,并更清楚地了解它们之间的差异的重要性。 另外,我们提供了一个可视化网络应用程序,确保结果能够被充分理解并解释。最后,我们展示了DEEP的一个典型使用案例。
https://arxiv.org/abs/2602.19583
Additive manufacturing is enabling soft robots with increasingly complex geometries, creating a demand for sensing solutions that remain compatible with single-material, one-step fabrication. Optical soft sensors are attractive for monolithic printing, but their performance is often degraded by uncontrolled light propagation (ambient coupling, leakage, scattering), while common miti- gation strategies typically require multimaterial interfaces. Here, we present an approach for 3D printed soft optical sensing (SOLen), in which a printed lens is placed in front of an emitter within a Y-shaped waveguide. The sensing mechanism relies on deformation-induced lens rotation and focal-spot translation, redistributing optical power between the two branches to generate a differential output that encodes both motion direction and amplitude. An acrylate polyurethane resin was modified with lauryl acrylate to improve compliance and optical transmittance, and single-layer optical characterization was used to derive wavelength-dependent refractive index and transmittance while minimizing DLP layer-related artifacts. The measured refractive index was used in simulations to design a lens profile for a target focal distance, which was then printed with sub-millimeter fidelity. Rotational tests demonstrated reproducible branch-selective signal switching over multiple cycles. These results establish a transferable material-to-optics workflow for soft optical sensors with lens with new functionalities for next-generation soft robots
https://arxiv.org/abs/2602.17421
Optical Character Recognition (OCR) is a fundamental task for digitizing information, serving as a critical bridge between visual data and textual understanding. While modern Vision-Language Models (VLM) have achieved high accuracy in this domain, they predominantly rely on autoregressive decoding, which becomes computationally expensive and slow for long documents as it requires a sequential forward pass for every generated token. We identify a key opportunity to overcome this bottleneck: unlike open-ended generation, OCR is a highly deterministic task where the visual input strictly dictates a unique output sequence, theoretically enabling efficient, parallel decoding via diffusion models. However, we show that existing masked diffusion models fail to harness this potential; those introduce structural instabilities that are benign in flexible tasks, like captioning, but catastrophic for the rigid, exact-match requirements of OCR. To bridge this gap, we introduce DODO, the first VLM to utilize block discrete diffusion and unlock its speedup potential for OCR. By decomposing generation into blocks, DODO mitigates the synchronization errors of global diffusion. Empirically, our method achieves near state-of-the-art accuracy while enabling up to 3x faster inference compared to autoregressive baselines.
https://arxiv.org/abs/2602.16872
Designing Optical Character Recognition (OCR) systems for India requires balancing linguistic diversity, document heterogeneity, and deployment constraints. In this paper, we study two training strategies for building multilingual OCR systems with Vision-Language Models through the Chitrapathak series. We first follow a popular multimodal approach, pairing a generic vision encoder with a strong multilingual language model and training the system end-to-end for OCR. Alternatively, we explore fine-tuning an existing OCR model, despite not being trained for the target languages. Through extensive evaluation on multilingual Indic OCR benchmarks and deployment-oriented metrics, we find that the second strategy consistently achieves better accuracy-latency trade-offs. Chitrapathak-2 achieves 3-6x speedup over its predecessor with being state-of-the-art (SOTA) in Telugu (6.69 char ANLS) and second best in the rest. In addition, we present Parichay, an independent OCR model series designed specifically for 9 Indian government documents to extract structured key fields, achieving 89.8% Exact Match score with a faster inference. Together, these systems achieve SOTA performance and provide practical guidance for building production-scale OCR pipelines in the Indian context.
https://arxiv.org/abs/2602.16430
Optical Character Recognition (OCR) of eighteenth-century printed texts remains challenging due to degraded print quality, archaic glyphs, and non-standardized orthography. Although transformer-based OCR systems and Vision-Language Models (VLMs) achieve strong aggregate accuracy, metrics such as Character Error Rate (CER) and Word Error Rate (WER) provide limited insight into their reliability for scholarly use. We compare a dedicated OCR transformer (TrOCR) and a general-purpose Vision-Language Model (Qwen) on line-level historical English texts using length-weighted accuracy metrics and hypothesis driven error analysis. While Qwen achieves lower CER/WER and greater robustness to degraded input, it exhibits selective linguistic regularization and orthographic normalization that may silently alter historically meaningful forms. TrOCR preserves orthographic fidelity more consistently but is more prone to cascading error propagation. Our findings show that architectural inductive biases shape OCR error structure in systematic ways. Models with similar aggregate accuracy can differ substantially in error locality, detectability, and downstream scholarly risk, underscoring the need for architecture-aware evaluation in historical digitization workflows.
https://arxiv.org/abs/2602.14524
Traditional Automatic License Plate Recognition (ALPR) systems employ multi-stage pipelines consisting of object detection networks followed by separate Optical Character Recognition (OCR) modules, introducing compounding errors, increased latency, and architectural complexity. This research presents Neural Sentinel, a novel unified approach that leverages Vision Language Models (VLMs) to perform license plate recognition, state classification, and vehicle attribute extraction through a single forward pass. Our primary contribution lies in demonstrating that a fine-tuned PaliGemma 3B model, adapted via Low-Rank Adaptation (LoRA), can simultaneously answer multiple visual questions about vehicle images, achieving 92.3% plate recognition accuracy, which is a 14.1% improvement over EasyOCR and 9.9% improvement over PaddleOCR baselines. We introduce a Human-in-the-Loop (HITL) continual learning framework that incorporates user corrections while preventing catastrophic forgetting through experience replay, maintaining a 70:30 ratio of original training data to correction samples. The system achieves a mean inference latency of 152ms with an Expected Calibration Error (ECE) of 0.048, indicating well calibrated confidence estimates. Additionally, the VLM first architecture enables zero-shot generalization to auxiliary tasks including vehicle color detection (89%), seatbelt detection (82%), and occupancy counting (78%) without task specific training. Through extensive experimentation on real world toll plaza imagery, we demonstrate that unified vision language approaches represent a paradigm shift in ALPR systems, offering superior accuracy, reduced architectural complexity, and emergent multi-task capabilities that traditional pipeline approaches cannot achieve.
传统的自动车牌识别(ALPR)系统通常采用多阶段管道,包括对象检测网络和随后的单独光学字符识别(OCR)模块。这种设计不仅引入了累积误差、增加了延迟时间,还导致架构复杂性增加。这项研究提出了一种名为“Neural Sentinel”的新方法,它利用视觉语言模型(VLMs),通过单一前向传递就能完成车牌识别、状态分类和车辆属性提取等工作。 我们的主要贡献在于证明了一个经过低秩适应(LoRA)调整并微调后的PaliGemma 3B模型,在处理多个关于车辆图像的视觉问题时能够同时回答,实现了92.3%的车牌识别准确率。这比EasyOCR高出了14.1%,相比PaddleOCR基线也提高了9.9%。 此外,我们还引入了一个“人在环路”(HITL)持续学习框架,在这个框架中,用户反馈被整合进系统来纠正错误,同时通过经验重播防止灾难性遗忘的发生。该系统维持着训练数据与修正样本之间70:30的比例关系,并且实现了平均推理延迟为152毫秒以及预期校准误差(ECE)仅为0.048,这表明其具有良好的置信度估计。 此外,VLM架构还允许在没有特定任务训练的情况下进行零样本泛化处理辅助任务,例如车辆颜色检测(准确率为89%)、安全带识别(准确率为82%)和占用计数(准确率为78%)。通过大量实验研究现实世界中的收费广场图像,我们展示了统一的视觉语言方法在ALPR系统中代表了一种范式转变,提供了更精确、减少架构复杂性,并且传统的流水线方法无法实现新兴多任务能力。
https://arxiv.org/abs/2602.07051
Optical character recognition (OCR), which converts printed or handwritten text into machine-readable form, is widely used in assistive technology for people with blindness and low vision. Yet, most evaluations rely on static datasets that do not reflect the challenges of mobile use. In this study, we systematically evaluated OCR performance under both static and dynamic conditions. Static tests measured detection range across distances of 1-7 meters and viewing angles of 0-75 degrees horizontally. Dynamic tests examined the impact of motion by varying walking speed from slow (0.8 m/s) to very fast (1.8 m/s) and comparing three camera mounting positions: head-mounted, shoulder-mounted, and hand-held. We evaluated both a smartphone and smart glasses, using the phone's main and ultra-wide cameras. Four OCR engines were benchmarked to assess accuracy at different distances and viewing angles: Google Vision, PaddleOCR 3.0, EasyOCR, and Tesseract. PaddleOCR 3.0 was then used to evaluate accuracy at different walking speeds. Accuracy was computed at the character level using the Levenshtein ratio against manually defined ground truth. Results showed that recognition accuracy declined with increased walking speed and wider viewing angles. Google Vision achieved the highest overall accuracy, with PaddleOCR close behind as the strongest open-source alternative. Across devices, the phone's main camera achieved the highest accuracy, and a shoulder-mounted placement yielded the highest average among body positions; however, differences among shoulder, head, and hand were not statistically significant.
https://arxiv.org/abs/2602.02223
Recent genomic foundation models largely adopt large language model architectures that treat DNA as a one-dimensional token sequence. However, exhaustive sequential reading is structurally misaligned with sparse and discontinuous genomic semantics, leading to wasted computation on low-information background and preventing understanding-driven compression for long contexts. Here, we present OpticalDNA, a vision-based framework that reframes genomic modeling as Optical Character Recognition (OCR)-style document understanding. OpticalDNA renders DNA into structured visual layouts and trains an OCR-capable vision--language model with a \emph{visual DNA encoder} and a \emph{document decoder}, where the encoder produces compact, reconstructible visual tokens for high-fidelity compression. Building on this representation, OpticalDNA defines prompt-conditioned objectives over core genomic primitives-reading, region grounding, subsequence retrieval, and masked span completion-thereby learning layout-aware DNA representations that retain fine-grained genomic information under a reduced effective token budget. Across diverse genomic benchmarks, OpticalDNA consistently outperforms recent baselines; on sequences up to 450k bases, it achieves the best overall performance with nearly $20\times$ fewer effective tokens, and surpasses models with up to $985\times$ more activated parameters while tuning only 256k \emph{trainable} parameters.
https://arxiv.org/abs/2602.02014
Large Vision-Language Models (LVLMs) are increasingly equipped with robust safety safeguards to prevent responses to harmful or disallowed prompts. However, these defenses often focus on analyzing explicit textual inputs or relevant visual scenes. In this work, we introduce Text-DJ, a novel jailbreak attack that bypasses these safeguards by exploiting the model's Optical Character Recognition (OCR) capability. Our methodology consists of three stages. First, we decompose a single harmful query into multiple and semantically related but more benign sub-queries. Second, we pick a set of distraction queries that are maximally irrelevant to the harmful query. Third, we present all decomposed sub-queries and distraction queries to the LVLM simultaneously as a grid of images, with the position of the sub-queries being middle within the grid. We demonstrate that this method successfully circumvents the safety alignment of state-of-the-art LVLMs. We argue this attack succeeds by (1) converting text-based prompts into images, bypassing standard text-based filters, and (2) inducing distractions, where the model's safety protocols fail to link the scattered sub-queries within a high number of irrelevant queries. Overall, our findings expose a critical vulnerability in LVLMs' OCR capabilities that are not robust to dispersed, multi-image adversarial inputs, highlighting the need for defenses for fragmented multimodal inputs.
https://arxiv.org/abs/2602.00420
Current methods for converting circuit schematic images into machine-readable netlists struggle with component recognition and connectivity inference. In this paper, we present SINA, an open-source, fully automated circuit schematic image-to-netlist generator. SINA integrates deep learning for accurate component detection, Connected-Component Labeling (CCL) for precise connectivity extraction, and Optical Character Recognition (OCR) for component reference designator retrieval, while employing a Vision-Language Model (VLM) for reliable reference designator assignments. In our experiments, SINA achieves 96.47% overall netlist-generation accuracy, which is 2.72x higher than state-of-the-art approaches.
https://arxiv.org/abs/2601.22114
Arabic is a linguistically and culturally rich language with a vast vocabulary that spans scientific, religious, and literary domains. Yet, large-scale lexical datasets linking Arabic words to precise definitions remain limited. We present MURAD (Multi-domain Unified Reverse Arabic Dictionary), an open lexical dataset with 96,243 word-definition pairs. The data come from trusted reference works and educational sources. Extraction used a hybrid pipeline integrating direct text parsing, optical character recognition, and automated reconstruction. This ensures accuracy and clarity. Each record aligns a target word with its standardized Arabic definition and metadata that identifies the source domain. The dataset covers terms from linguistics, Islamic studies, mathematics, physics, psychology, and engineering. It supports computational linguistics and lexicographic research. Applications include reverse dictionary modeling, semantic retrieval, and educational tools. By releasing this resource, we aim to advance Arabic natural language processing and promote reproducible research on Arabic lexical semantics.
https://arxiv.org/abs/2601.21512