Optical Character Recognition (OCR) for low-resource languages remains a significant challenge due to the scarcity of large-scale annotated training datasets. Languages such as Kashmiri, with approximately 7 million speakers and a complex Perso-Arabic script featuring unique diacritical marks, currently lack support in major OCR systems including Tesseract, TrOCR, and PaddleOCR. Manual dataset creation for such languages is prohibitively expensive, time-consuming, and error-prone, often requiring word by word transcription of printed or handwritten text. We present SynthOCR-Gen, an open-source synthetic OCR dataset generator specifically designed for low-resource languages. Our tool addresses the fundamental bottleneck in OCR development by transforming digital Unicode text corpora into ready-to-use training datasets. The system implements a comprehensive pipeline encompassing text segmentation (character, word, n-gram, sentence, and line levels), Unicode normalization with script purity enforcement, multi-font rendering with configurable distribution, and 25+ data augmentation techniques simulating real-world document degradations including rotation, blur, noise, and scanner artifacts. We demonstrate the efficacy of our approach by generating a 600,000-sample word-segmented Kashmiri OCR dataset, which we release publicly on HuggingFace. This work provides a practical pathway for bringing low-resource languages into the era of vision-language AI models, and the tool is openly available for researchers and practitioners working with underserved writing systems worldwide.
对于资源匮乏的语言,光学字符识别(OCR)仍然是一项重大挑战,主要是因为缺乏大规模的标注训练数据集。像克什米尔语这样的语言,拥有大约700万使用者和复杂的波斯-阿拉伯书写系统,其中包括独特的标点符号,在Tesseract、TrOCR 和 PaddleOCR 等主要 OCR 系统中目前仍得不到支持。为这类语言创建手动数据集的成本高得令人难以承受,耗时且容易出错,并常常需要逐词转录印刷或手写文本。 我们提出了一种开源的合成 OCR 数据集生成器——SynthOCR-Gen,专门针对资源匮乏的语言设计。该工具通过将数字 Unicode 文本语料库转换为即用型训练数据集来解决 OCR 开发中的基本瓶颈问题。系统实现了一个全面的工作流程,包括文本分割(字符、单词、n-gram、句子和行级别)、Unicode 正规化以及强制实施书写系统的纯度,多字体渲染和支持配置的分布设置,以及 25 多种数据增强技术来模拟现实世界文档退化的多种情况,如旋转、模糊、噪声和扫描器产生的伪影。 我们通过生成一个包含60万样本的克什米尔语单词级 OCR 数据集来展示了这种方法的有效性,并将其公开发布在 HuggingFace 上。本工作为资源匮乏的语言进入视觉-语言 AI 模型时代提供了一条实用路径,且工具对全世界从事未得到充分服务的文字系统的研究人员和实践者完全开放使用。
https://arxiv.org/abs/2601.16113
The ubiquity of Large Language Models (LLMs) is driving a paradigm shift where user convenience supersedes computational efficiency. This article defines the "Plausibility Trap": a phenomenon where individuals with access to Artificial Intelligence (AI) models deploy expensive probabilistic engines for simple deterministic tasks-such as Optical Character Recognition (OCR) or basic verification-resulting in significant resource waste. Through micro-benchmarks and case studies on OCR and fact-checking, we quantify the "efficiency tax"-demonstrating a ~6.5x latency penalty-and the risks of algorithmic sycophancy. To counter this, we introduce Tool Selection Engineering and the Deterministic-Probabilistic Decision Matrix, a framework to help developers determine when to use Generative AI and, crucially, when to avoid it. We argue for a curriculum shift, emphasizing that true digital literacy relies not only in knowing how to use Generative AI, but also on knowing when not to use it.
大型语言模型(LLMs)的普及正在推动一种范式转变,其中用户便利性超越了计算效率。本文定义了“可信度陷阱”:拥有访问人工智能模型权限的个人会为简单的确定性任务——如光学字符识别(OCR)或基本验证——部署昂贵的概率引擎,从而导致资源浪费的现象。通过微基准测试和关于OCR及事实核查的案例研究,我们量化了这种“效率税”,即大约6.5倍的延迟惩罚,并探讨算法谄媚的风险。为应对这一问题,我们介绍了工具选择工程和确定性-概率决策矩阵这一框架,旨在帮助开发者决定何时使用生成式AI以及最重要的是何时避免使用它。我们认为应该进行课程改革,强调真正的数字素养不仅在于知道如何使用生成式AI,还在于知道在什么时候不应该使用它。
https://arxiv.org/abs/2601.15130
Automatic License Plate Recognition is a frequent research topic due to its wide-ranging practical applications. While recent studies use synthetic images to improve License Plate Recognition (LPR) results, there remain several limitations in these efforts. This work addresses these constraints by comprehensively exploring the integration of real and synthetic data to enhance LPR performance. We subject 16 Optical Character Recognition (OCR) models to a benchmarking process involving 12 public datasets acquired from various regions. Several key findings emerge from our investigation. Primarily, the massive incorporation of synthetic data substantially boosts model performance in both intra- and cross-dataset scenarios. We examine three distinct methodologies for generating synthetic data: template-based generation, character permutation, and utilizing a Generative Adversarial Network (GAN) model, each contributing significantly to performance enhancement. The combined use of these methodologies demonstrates a notable synergistic effect, leading to end-to-end results that surpass those reached by state-of-the-art methods and established commercial systems. Our experiments also underscore the efficacy of synthetic data in mitigating challenges posed by limited training data, enabling remarkable results to be achieved even with small fractions of the original training data. Finally, we investigate the trade-off between accuracy and speed among different models, identifying those that strike the optimal balance in each intra-dataset and cross-dataset settings.
自动车牌识别(Automatic License Plate Recognition,ALPR)是一个由于其实用性广泛而备受研究的课题。尽管最近的研究使用合成图像来改进车牌识别(License Plate Recognition,LPR)结果,但仍存在一些限制。这项工作通过全面探索真实数据和合成数据的整合来解决这些限制,以提高LPR性能。我们对16种光学字符识别(OCR)模型进行了基准测试,涉及从不同地区获取的12个公共数据集。我们的研究得出了一些关键发现: 首先,大量使用合成数据显著提升了模型在内部分布和跨分布场景中的表现。 其次,我们探讨了三种不同的生成合成数据的方法:基于模板的生成、字符排列以及利用生成对抗网络(GAN)模型,每种方法都对性能提升有重大贡献。这几种方法结合使用的综合效果明显,使得从端到端的结果超越现有的最先进的方法和已建立的商业系统。 此外,我们的实验强调了合成数据在缓解由于训练数据量有限带来的挑战方面的有效性,在使用少量原始训练数据的情况下也能获得显著结果。 最后,我们研究了不同模型之间准确性和速度之间的权衡,并根据每个内部分布和跨分布场景识别出达到最佳平衡的那些模型。
https://arxiv.org/abs/2601.07671
Document layout analysis aims to detect and categorize structural elements (e.g., titles, tables, figures) in scanned or digital documents. Popular methods often rely on high-quality Optical Character Recognition (OCR) to merge visual features with extracted text. This dependency introduces two major drawbacks: propagation of text recognition errors and substantial computational overhead, limiting the robustness and practical applicability of multimodal approaches. In contrast to the prevailing multimodal trend, we argue that effective layout analysis depends not on text-visual fusion, but on a deep understanding of documents' intrinsic visual structure. To this end, we propose PARL (Position-Aware Relation Learning Network), a novel OCR-free, vision-only framework that models layout through positional sensitivity and relational structure. Specifically, we first introduce a Bidirectional Spatial Position-Guided Deformable Attention module to embed explicit positional dependencies among layout elements directly into visual features. Second, we design a Graph Refinement Classifier (GRC) to refine predictions by modeling contextual relationships through a dynamically constructed layout graph. Extensive experiments show PARL achieves state-of-the-art results. It establishes a new benchmark for vision-only methods on DocLayNet and, notably, surpasses even strong multimodal models on M6Doc. Crucially, PARL (65M) is highly efficient, using roughly four times fewer parameters than large multimodal models (256M), demonstrating that sophisticated visual structure modeling can be both more efficient and robust than multimodal fusion.
文档布局分析的目标是检测和分类扫描或数字文档中的结构元素(例如,标题、表格、图片等)。现有流行的方法通常依赖于高质量的光学字符识别(OCR)技术,将视觉特征与提取出来的文本结合起来。这种依赖性带来了两个主要缺点:文字识别错误的传播以及计算开销过大,这限制了多模态方法的鲁棒性和实用性。 不同于当前流行的多模态趋势,我们认为有效的布局分析并不需要文字和图像的信息融合,而是基于对文档内在视觉结构的深刻理解。为此,我们提出了PARL(位置感知关系学习网络),这是一种新颖的、不依赖OCR的纯视觉框架,通过引入位置敏感性和关联性来建模布局。 具体来说,我们首先提出了一种双向空间位置引导的变形注意力模块,将布局元素之间的显式位置依存关系直接嵌入到视觉特征中。其次,设计了一个图细化分类器(GRC),该分类器通过动态构建的布局图模型上下文关系来优化预测结果。 大量的实验表明PARL实现了最先进的效果,在DocLayNet数据集上为纯视觉方法建立了新的基准,并且在M6Doc数据集中甚至超过了强大的多模态模型。特别重要的是,PARL(具有约6500万个参数)非常高效,使用的参数仅为大规模的多模态模型(大约2.5亿个参数)的四分之一左右。这表明复杂视觉结构建模可以比多模态融合更为有效且鲁棒。
https://arxiv.org/abs/2601.07620
Bahnar, a minority language spoken across Vietnam, Cambodia, and Laos, faces significant preservation challenges due to limited research and data availability. This study addresses the critical need for accurate digitization of Bahnar language documents through optical character recognition (OCR) technology. Digitizing scanned paper documents poses significant challenges, as degraded image quality from broken or blurred areas introduces considerable OCR errors that compromise information retrieval systems. We propose a comprehensive approach combining advanced table and non-table detection techniques with probability-based post-processing heuristics to enhance recognition accuracy. Our method first applies detection algorithms to improve input data quality, then employs probabilistic error correction on OCR output. Experimental results indicate a substantial improvement, with recognition accuracy increasing from 72.86% to 79.26%. This work contributes valuable resources for Bahnar language preservation and provides a framework applicable to other minority language digitization efforts.
巴纳语(Bahnar)是越南、柬埔寨和老挝地区的一种少数民族语言,面临着由于研究不足和数据稀缺而导致的保存挑战。本研究旨在通过光学字符识别(OCR)技术准确地对巴纳语文档进行数字化处理,以解决这一关键问题。将纸质文件扫描并数字化的过程中存在重大挑战,因为破损或模糊区域导致图像质量下降,从而引入了大量的OCR错误,影响信息检索系统的准确性。 我们提出了一种结合先进的表格和非表格检测方法与基于概率的后处理启发式算法的全面解决方案,以提高识别精度。该方法首先使用检测算法来改进输入数据的质量,然后在OCR输出上应用概率误差校正。实验结果表明,在采用上述方案之后,识别准确率从72.86%提升到了79.26%,取得了显著的进步。 这项工作为巴纳语的保存提供了宝贵的资源,并且也为其他少数民族语言的数字化努力提供了一个适用框架。
https://arxiv.org/abs/2601.02965
This technical report presents the 600K-KS-OCR Dataset, a large-scale synthetic corpus comprising approximately 602,000 word-level segmented images designed for training and evaluating optical character recognition systems targeting Kashmiri script. The dataset addresses a critical resource gap for Kashmiri, an endangered Dardic language utilizing a modified Perso-Arabic writing system spoken by approximately seven million people. Each image is rendered at 256x64 pixels with corresponding ground-truth transcriptions provided in multiple formats compatible with CRNN, TrOCR, and generalpurpose machine learning pipelines. The generation methodology incorporates three traditional Kashmiri typefaces, comprehensive data augmentation simulating real-world document degradation, and diverse background textures to enhance model robustness. The dataset is distributed across ten partitioned archives totaling approximately 10.6 GB and is released under the CC-BY-4.0 license to facilitate research in low-resource language optical character recognition.
这份技术报告介绍了600K-KS-OCR数据集,这是一个大型合成语料库,包含大约602,000张单字符级别的分割图像,旨在用于训练和评估针对克什米尔语的光学字符识别系统。该数据集解决了克什米尔语资源匮乏的问题,这是一种濒危的达迪语族语言,使用修改后的波斯-阿拉伯书写系统,由约七百万人使用。每一张图像是以256x64像素生成,并附有与CRNN、TrOCR和通用机器学习管道兼容的多个格式的真实文本转录。生成方法采用了三种传统的克什米尔字体,全面的数据增强技术模拟了真实文档退化的状况,并加入了多种背景纹理来提升模型的鲁棒性。该数据集分布在十个分区档案中,总大小约为10.6GB,并在CC-BY-4.0许可下发布以促进低资源语言光学字符识别领域的研究工作。
https://arxiv.org/abs/2601.01088
Handwritten text recognition and optical character recognition solutions show excellent results with processing data of modern era, but efficiency drops with Latin documents of medieval times. This paper presents a deep learning method to extract text information from handwritten Latin-language documents of the 9th to 11th centuries. The approach takes into account the properties inherent in medieval documents. The paper provides a brief introduction to the field of historical document transcription, a first-sight analysis of the raw data, and the related works and studies. The paper presents the steps of dataset development for further training of the models. The explanatory data analysis of the processed data is provided as well. The paper explains the pipeline of deep learning models to extract text information from the document images, from detecting objects to word recognition using classification models and embedding word images. The paper reports the following results: recall, precision, F1 score, intersection over union, confusion matrix, and mean string distance. The plots of the metrics are also included. The implementation is published on the GitHub repository.
手写文本识别和光学字符识别解决方案在处理现代数据时表现出色,但在处理中世纪拉丁文文件时效率下降。本文提出了一种深度学习方法,用于从公元9世纪到11世纪的手写拉丁语文档中提取文本信息。该方法考虑到了中世纪文档特有的属性。论文简要介绍了历史文献转录领域的概况、原始数据的初步分析以及相关的工作和研究。文章还阐述了为后续模型训练而开发的数据集步骤,并提供了对处理后的数据进行解释性数据分析的内容。 论文详细说明了从检测对象到使用分类模型和嵌入词图像进行单词识别的深度学习模型提取文档图像文本信息的流程。该文报告了以下结果:召回率、精确度、F1分数、交并比(Intersection over Union)、混淆矩阵以及平均字符串距离,并包括了这些指标的图表。 实现代码已发布在GitHub仓库中。
https://arxiv.org/abs/2512.18865
Depth-of-field control is essential in photography, but getting the perfect focus often takes several tries or special equipment. Single-image refocusing is still difficult. It involves recovering sharp content and creating realistic bokeh. Current methods have significant drawbacks. They need all-in-focus inputs, depend on synthetic data from simulators, and have limited control over aperture. We introduce Generative Refocusing, a two-step process that uses DeblurNet to recover all-in-focus images from various inputs and BokehNet for creating controllable bokeh. Our main innovation is semi-supervised training. This method combines synthetic paired data with unpaired real bokeh images, using EXIF metadata to capture real optical characteristics beyond what simulators can provide. Our experiments show we achieve top performance in defocus deblurring, bokeh synthesis, and refocusing benchmarks. Additionally, our Generative Refocusing allows text-guided adjustments and custom aperture shapes.
景深控制在摄影中至关重要,但要获得完美的焦点通常需要多次尝试或特殊设备。单图像重新对焦仍然具有挑战性,因为它涉及到恢复清晰的内容并生成逼真的背景虚化效果(bokeh)。目前的方法存在一些显著的缺点:它们依赖于所有内容都聚焦的照片输入、需要来自模拟器的合成数据,并且对于光圈控制有限。 我们引入了一种名为“生成式重新对焦”的两步过程,该方法使用DeblurNet从各种输入中恢复出全部清晰度的图像,并通过BokehNet创建可调节的背景虚化效果。我们的主要创新在于半监督训练方式:这种方法结合了合成配对数据与未配对的真实背景虚化图像,并利用EXIF元数据来捕捉模拟器无法提供的真实光学特性。 实验结果显示,我们在脱焦模糊校正、背景虚化生成以及重新对焦基准测试中取得了最佳性能。此外,我们的“生成式重新对焦”方法还支持基于文本的调整和自定义光圈形状。
https://arxiv.org/abs/2512.16923
Multi-modal retrieval-augmented generation (MM-RAG) promises grounded biomedical QA, but it is unclear when to (i) convert figures/tables into text versus (ii) use optical character recognition (OCR)-free visual retrieval that returns page images and leaves interpretation to the generator. We study this trade-off in glycobiology, a visually dense domain. We built a benchmark of 120 multiple-choice questions (MCQs) from 25 papers, stratified by retrieval difficulty (easy text, medium figures/tables, hard cross-evidence). We implemented four augmentations-None, Text RAG, Multi-modal conversion, and late-interaction visual retrieval (ColPali)-using Docling parsing and Qdrant indexing. We evaluated mid-size open-source and frontier proprietary models (e.g., Gemma-3-27B-IT, GPT-4o family). Additional testing used the GPT-5 family and multiple visual retrievers (ColPali/ColQwen/ColFlor). Accuracy with Agresti-Coull 95% confidence intervals (CIs) was computed over 5 runs per configuration. With Gemma-3-27B-IT, Text and Multi-modal augmentation outperformed OCR-free retrieval (0.722-0.740 vs. 0.510 average accuracy). With GPT-4o, Multi-modal achieved 0.808, with Text 0.782 and ColPali 0.745 close behind; within-model differences were small. In follow-on experiments with the GPT-5 family, the best results with ColPali and ColFlor improved by ~2% to 0.828 in both cases. In general, across the GPT-5 family, ColPali, ColQwen, and ColFlor were statistically indistinguishable. GPT-5-nano trailed larger GPT-5 variants by roughly 8-10%. Pipeline choice is capacity-dependent: converting visuals to text lowers the reader burden and is more reliable for mid-size models, whereas OCR-free visual retrieval becomes competitive under frontier models. Among retrievers, ColFlor offers parity with heavier options at a smaller footprint, making it an efficient default when strong generators are available.
多模态检索增强生成(MM-RAG)为生物医学问答提供了坚实的基础,但何时将图表转换成文本或使用无需光学字符识别(OCR)的视觉检索来返回页面图像并让生成器进行解释尚不清楚。我们在糖生物学这一视觉密集型领域研究了这种权衡。我们建立了一个包含120个多项选择题(MCQs)的基准测试,这些题目来自25篇论文,并根据检索难度分为三类:简单文本、中等图表/表格和困难跨证据问题。我们实现了四种增强方法——无增强、Text RAG、多模态转换以及迟延交互视觉检索(ColPali),使用Docling解析与Qdrant索引技术。我们评估了中型开源模型和前沿专有模型(例如Gemma-3-27B-IT,GPT-4o家族)。额外的测试使用了GPT-5家族及多种视觉检索器(ColPali/ColQwen/ColFlor)。通过五次运行每个配置后的Agresti-Coull 95%置信区间计算准确率。在使用Gemma-3-27B-IT时,Text和多模态增强优于无OCR的视觉检索(0.722至0.740对0.510平均精度)。对于GPT-4o模型,多模态方法达到了0.808的准确率,而文本模式为0.782,ColPali则略低至0.745;同一模型内部差异较小。在后续使用GPT-5家族的实验中,ColPali和ColFlor的最佳结果提升了大约2%达到0.828(两个案例均是)。总的来说,在整个GPT-5家族中,ColPali、ColQwen和ColFlor统计上没有显著差异。而GPT-5-nano则落后于较大的GPT-5变体约8%-10%。 管道选择取决于容量:将视觉元素转换为文本降低了阅读负担,并且对于中型模型来说更为可靠,而对于前沿模型而言,则无需OCR的视觉检索更具竞争力。在检索器方面,ColFlor提供了轻量级的解决方案,其性能与较重的选择相媲美,在强生成器可用时成为高效默认选择。
https://arxiv.org/abs/2512.16802
The rapid development of Generative AI is bringing innovative changes to education and assessment. As the prevalence of students utilizing AI for assignments increases, concerns regarding academic integrity and the validity of assessments are growing. This study utilizes the Earth Science I section of the 2025 Korean College Scholastic Ability Test (CSAT) to deeply analyze the multimodal scientific reasoning capabilities and cognitive limitations of state-of-the-art Large Language Models (LLMs), including GPT-4o, Gemini 2.5 Flash, and Gemini 2.5 Pro. Three experimental conditions (full-page input, individual item input, and optimized multimodal input) were designed to evaluate model performance across different data structures. Quantitative results indicated that unstructured inputs led to significant performance degradation due to segmentation and Optical Character Recognition (OCR) failures. Even under optimized conditions, models exhibited fundamental reasoning flaws. Qualitative analysis revealed that "Perception Errors" were dominant, highlighting a "Perception-Cognition Gap" where models failed to interpret symbolic meanings in schematic diagrams despite recognizing visual data. Furthermore, models demonstrated a "Calculation-Conceptualization Discrepancy," successfully performing calculations while failing to apply the underlying scientific concepts, and "Process Hallucination," where models skipped visual verification in favor of plausible but unfounded background knowledge. Addressing the challenge of unauthorized AI use in coursework, this study provides actionable cues for designing "AI-resistant questions" that target these specific cognitive vulnerabilities. By exploiting AI's weaknesses, such as the gap between perception and cognition, educators can distinguish genuine student competency from AI-generated responses, thereby ensuring assessment fairness.
生成式人工智能的迅速发展正在为教育和评估带来创新性的变化。随着学生利用AI完成作业的情况日益增多,关于学术诚信和评估有效性的担忧也在增长。这项研究使用了2025年韩国大学修学能力考试(CSAT)中的地球科学第一部分,对包括GPT-4o、Gemini 2.5 Flash 和 Gemini 2.5 Pro在内的最先进大型语言模型(LLMs)的多模态科学推理能力和认知限制进行了深入分析。设计了三种实验条件(全页输入、单项输入和优化后的多模态输入),以评估不同数据结构下模型的表现。定量结果显示,无结构化的输入导致了由于分段处理和光学字符识别失败而导致的重大性能下降。即使在最优化条件下,模型也展示出基本的推理缺陷。定性分析表明,“感知错误”占据主导地位,这揭示了一个“感知-认知差距”,即尽管能够识别视觉数据,但这些模型却未能解读图表中的象征意义。此外,模型还展示了“计算-概念化不一致”的问题,在这种情况下,虽然成功地完成了计算任务,但却无法应用其中的科学原理,并且出现“过程幻觉”,即在缺乏明确依据的情况下利用看似合理的背景知识而跳过视觉验证步骤。针对未经授权使用AI进行课程作业的问题,这项研究为设计能够应对这些特定认知弱点的“抗AI问题”提供了实际建议。通过利用AI的弱点(例如感知与认知之间的差距),教育工作者可以区分出学生的真实能力与AI生成的答案,从而确保评估的公平性。
https://arxiv.org/abs/2512.15298
Document image enhancement and binarization are commonly performed prior to document analysis and recognition tasks for improving the efficiency and accuracy of optical character recognition (OCR) systems. This is because directly recognizing text in degraded documents, particularly in color images, often results in unsatisfactory recognition performance. To address these issues, existing methods train independent generative adversarial networks (GANs) for different color channels to remove shadows and noise, which, in turn, facilitates efficient text information extraction. However, deploying multiple GANs results in long training and inference times. To reduce both training and inference times of document image enhancement and binarization models, we propose MFE-GAN, an efficient GAN-based framework with multi-scale feature extraction (MFE), which incorporates Haar wavelet transformation (HWT) and normalization to process document images before feeding them into GANs for training. In addition, we present novel generators, discriminators, and loss functions to improve the model's performance, and we conduct ablation studies to demonstrate their effectiveness. Experimental results on the Benchmark, Nabuco, and CMATERdb datasets demonstrate that the proposed MFE-GAN significantly reduces the total training and inference times while maintaining comparable performance with respect to state-of-the-art (SOTA) methods. The implementation of this work is available at this https URL.
文档图像增强和二值化通常在文档分析和识别任务之前进行,以提高光学字符识别(OCR)系统的效率和准确性。这是因为直接从质量较差的文档中(尤其是在彩色图像中)识别文本往往会导致不理想的识别性能。为了解决这些问题,现有的方法训练独立的生成对抗网络(GANs),针对不同的颜色通道来移除阴影和噪声,这有助于高效地提取文本信息。然而,部署多个GAN会增加训练和推理的时间。 为了减少文档图像增强和二值化模型的训练及推理时间,我们提出了一种基于GAN的高效框架MFE-GAN,该框架集成了多尺度特征提取(MFE),使用Haar小波变换(HWT)和标准化处理文档图像,在输入给GAN进行训练之前对其进行预处理。此外,我们还提出了新颖的生成器、判别器和损失函数以提高模型性能,并通过消融研究证明了它们的有效性。 在Benchmark、Nabuco和CMATERdb数据集上的实验结果表明,所提出的MFE-GAN显著减少了总的训练和推理时间,同时保持与最先进的(SOTA)方法相当的性能。这项工作的实现可在以下网址获得:[此处应插入实际链接]。
https://arxiv.org/abs/2512.14114
With their high information density and intuitive readability, charts have become the de facto medium for data analysis and communication across disciplines. Recent multimodal large language models (MLLMs) have made notable progress in automated chart understanding, yet they remain heavily dependent on explicit textual annotations and the performance degrades markedly when key numerals are absent. To address this limitation, we introduce ChartAgent, a chart understanding framework grounded in Tool-Integrated Reasoning (TIR). Inspired by human cognition, ChartAgent decomposes complex chart analysis into a sequence of observable, replayable steps. Supporting this architecture is an extensible, modular tool library comprising more than a dozen core tools, such as keyelement detection, instance segmentation, and optical character recognition (OCR), which the agent dynamically orchestrates to achieve systematic visual parsing across diverse chart types. Leveraging TIRs transparency and verifiability, ChartAgent moves beyond the black box paradigm by standardizing and consolidating intermediate outputs into a structured Evidence Package, providing traceable and reproducible support for final conclusions. Experiments show that ChartAgent substantially improves robustness under sparse annotation settings, offering a practical path toward trustworthy and extensible systems for chart understanding.
凭借其高信息密度和直观易读性,图表已成为跨学科数据解析与交流的公认媒介。近年来,多模态大型语言模型(MLLMs)在自动化图表理解方面取得了显著进展,但它们仍然严重依赖明确的文字注释,并且当关键数字缺失时性能会明显下降。为了解决这一局限性,我们引入了基于工具集成推理(TIR)的图表理解框架ChartAgent。受到人类认知启发,ChartAgent将复杂的图表分析分解成一系列可观测、可重演的操作步骤。支撑这种架构的是一个扩展性强且模块化的工具库,包括关键元素检测、实例分割和光学字符识别(OCR)等超过十种核心工具,这些工具由代理动态协调以实现不同类型的系统化视觉解析。通过TIR的透明性和可验证性,ChartAgent超越了黑盒模型范式,通过标准化并整合中间输出为结构化的证据包,提供最终结论的可追溯性和可重复性支持。实验表明,在注释稀疏的情况下,ChartAgent显著提高了鲁棒性,并为图表理解提供了值得信赖且具有扩展性的系统实现途径。
https://arxiv.org/abs/2512.14040
Extracting coherent and human-understandable themes from large collections of unstructured historical newspaper archives presents significant challenges due to topic evolution, Optical Character Recognition (OCR) noise, and the sheer volume of text. Traditional topic-modeling methods, such as Latent Dirichlet Allocation (LDA), often fall short in capturing the complexity and dynamic nature of discourse in historical texts. To address these limitations, we employ BERTopic. This neural topic-modeling approach leverages transformerbased embeddings to extract and classify topics, which, despite its growing popularity, still remains underused in historical research. Our study focuses on articles published between 1955 and 2018, specifically examining discourse on nuclear power and nuclear safety. We analyze various topic distributions across the corpus and trace their temporal evolution to uncover long-term trends and shifts in public discourse. This enables us to more accurately explore patterns in public discourse, including the co-occurrence of themes related to nuclear power and nuclear weapons and their shifts in topic importance over time. Our study demonstrates the scalability and contextual sensitivity of BERTopic as an alternative to traditional approaches, offering richer insights into historical discourses extracted from newspaper archives. These findings contribute to historical, nuclear, and social-science research while reflecting on current limitations and proposing potential directions for future work.
从大量未结构化的历史报纸档案中提取连贯且易于人类理解的主题面临着重大挑战,这些挑战包括主题演变、光学字符识别(OCR)噪声以及海量文本数据。传统的主题建模方法,如潜在狄利克雷分配(LDA),在捕捉历史性文献复杂性和动态性方面往往表现不足。为了克服这一局限性,我们采用BERTopic进行研究。这种方法利用基于变压器的嵌入技术来提取和分类主题,尽管其日益流行,在历史研究中仍被广泛忽视。我们的研究关注1955年至2018年间发表的文章,特别是关于核能与核电安全方面的讨论。通过分析整个语料库中的各种主题分布及其时间演变,我们揭示了长期趋势以及公共话语的转变,并能够更准确地探索公众讨论模式,包括相关主题(如核能和核武器)的同时出现及其随时间变化的主题重要性。 我们的研究展示了BERTopic在处理历史文献方面的可扩展性和上下文敏感度,为传统方法提供了一种替代方案。这为我们提供了对从报纸档案中提取的历史话语的更深层次理解。这些发现不仅有助于历史、核能和社会科学的研究,而且还反思了当前限制,并提出了未来工作的潜在方向。
https://arxiv.org/abs/2512.11635
Document shadow removal is essential for enhancing the clarity of digitized documents. Preserving high-frequency details (e.g., text edges and lines) is critical in this process because shadows often obscure or distort fine structures. This paper proposes a matte vision transformer (MatteViT), a novel shadow removal framework that applies spatial and frequency-domain information to eliminate shadows while preserving fine-grained structural details. To effectively retain these details, we employ two preservation strategies. First, our method introduces a lightweight high-frequency amplification module (HFAM) that decomposes and adaptively amplifies high-frequency components. Second, we present a continuous luminance-based shadow matte, generated using a custom-built matte dataset and shadow matte generator, which provides precise spatial guidance from the earliest processing stage. These strategies enable the model to accurately identify fine-grained regions and restore them with high fidelity. Extensive experiments on public benchmarks (RDD and Kligler) demonstrate that MatteViT achieves state-of-the-art performance, providing a robust and practical solution for real-world document shadow removal. Furthermore, the proposed method better preserves text-level details in downstream tasks, such as optical character recognition, improving recognition performance over prior methods.
文档阴影去除对于提升数字化文档的清晰度至关重要。保留高频细节(例如文本边缘和线条)在这个过程中非常关键,因为阴影往往会遮蔽或扭曲细小结构。本文提出了一种新颖的阴影去除框架——MatteViT(图层视觉变换器),它结合了空间信息与频域信息来消除阴影的同时保持精细的结构细节。为了有效保留这些细节,我们采用两种保护策略。首先,我们的方法引入了一个轻量级高频放大模块(HFAM),该模块可以分解并自适应地增强高频分量。其次,我们提出了一种基于连续亮度的图层生成技术,通过使用定制构建的数据集和阴影图层生成器来提供精确的空间引导信息,从最早的处理阶段开始便发挥作用。这些策略使模型能够准确识别精细区域,并以高保真度进行恢复。 在公共基准测试(RDD和Kligler)上的广泛实验表明,MatteViT达到了最先进的性能水平,为现实世界中的文档阴影去除提供了稳健且实用的解决方案。此外,所提出的方法在下游任务中更好地保留了文本级别的细节,例如光学字符识别,其识别准确率优于先前方法。
https://arxiv.org/abs/2512.08789
Vision Large Language Models (VLLMs) incur high computational costs due to their reliance on hundreds of visual tokens to represent images. While token pruning offers a promising solution for accelerating inference, this paper, however, identifies a key observation: in deeper layers (e.g., beyond the 20th), existing training-free pruning methods perform no better than random pruning. We hypothesize that this degradation is caused by "vanishing token information", where visual tokens progressively lose their salience with increasing network depth. To validate this hypothesis, we quantify a token's information content by measuring the change in the model output probabilities upon its removal. Using this proposed metric, our analysis of the information of visual tokens across layers reveals three key findings: (1) As layers deepen, the information of visual tokens gradually becomes uniform and eventually vanishes at an intermediate layer, which we term as "information horizon", beyond which the visual tokens become redundant; (2) The position of this horizon is not static; it extends deeper for visually intensive tasks, such as Optical Character Recognition (OCR), compared to more general tasks like Visual Question Answering (VQA); (3) This horizon is also strongly correlated with model capacity, as stronger VLLMs (e.g., Qwen2.5-VL) employ deeper visual tokens than weaker models (e.g., LLaVA-1.5). Based on our findings, we show that simple random pruning in deep layers efficiently balances performance and efficiency. Moreover, integrating random pruning consistently enhances existing methods. Using DivPrune with random pruning achieves state-of-the-art results, maintaining 96.9% of Qwen-2.5-VL-7B performance while pruning 50% of visual tokens. The code will be publicly available at this https URL.
视觉大型语言模型(VLLMs)由于依赖数百个视觉令牌来表示图像,因此计算成本高昂。虽然令牌修剪为加速推理提供了一个有前景的解决方案,但本文却发现了关键观察:在深层网络中(例如超过第20层),现有的无训练修剪方法的表现并不比随机修剪更好。我们假设这种退化是由“消失的令牌信息”引起的,在这种情况下,随着网络深度的增加,视觉令牌逐渐失去其显著性。 为了验证这一假设,我们通过测量移除某个令牌后模型输出概率的变化来量化令牌的信息内容。使用这一提出的指标,对跨层的视觉令牌信息进行分析揭示了三个关键发现: 1. 随着层数加深,视觉令牌的信息变得越来越均匀,并在中间某一层完全消失,我们将这个点称为“信息地平线”,在此之后,视觉令牌变得多余。 2. 这个地平线的位置不是固定的;对于光学字符识别(OCR)等视觉密集型任务而言,它比诸如视觉问答(VQA)之类的通用任务延伸更深。 3. 此外,该地平线也与模型能力密切相关。更强的 VLLM 模型(例如 Qwen2.5-VL)使用更深层的视觉令牌,而较弱的模型则不如此。 根据我们的发现,我们在深层网络中展示了简单的随机修剪可以有效地平衡性能和效率。此外,整合随机修剪能够持续改进现有的方法。利用 DivPrune 结合随机修剪实现了最先进的成果,在剪枝50%视觉令牌的情况下,维持了 Qwen-2.5-VL-7B 模型96.9%的性能。 相关代码将公开发布在 https://this-url.com 的链接中。
https://arxiv.org/abs/2512.07580
Large Vision-Language Models (LVLMs) have demonstrated remarkable success in a broad range of vision-language tasks, such as general visual question answering and optical character recognition (OCR). However, their performance on perception-centric tasks -- such as object detection, semantic segmentation, and depth estimation -- remains significantly inferior to that of task-specific expert models. For example, Qwen2.5-VL-7B-Instruct achieves only 19% mAP on COCO2017 val, particularly struggling with dense scenes and small object recall. In this work, we introduce Chain-of-Thought for Detection (CoT4Det), a simple but efficient strategy that reformulates perception tasks into three interpretable steps: classification, counting, and grounding -- each more naturally aligned with the reasoning capabilities of LVLMs. Extensive experiments demonstrate that our method significantly improves perception performance without compromising general vision language capabilities. With a standard Qwen2.5-VL-7B-Instruct, CoT4Det boosts mAP from 19.0% to 33.0% on COCO2017 val and achieves competitive results across a variety of perception benchmarks, outperforming baselines by +2% on RefCOCO series and 19% on Flickr30k entities.
大型视觉-语言模型(LVLMs)在广泛的视觉-语言任务中展示了显著的成功,例如通用的视觉问答和光学字符识别(OCR)。然而,在以感知为中心的任务——如物体检测、语义分割和深度估计上,它们的表现仍然明显低于特定任务的专家模型。例如,Qwen2.5-VL-7B-Instruct在COCO2017验证集上的mAP仅为19%,尤其在密集场景和小目标召回方面表现不佳。 在这项工作中,我们介绍了用于检测的Chain-of-Thought方法(CoT4Det),这是一种简单而高效的策略,它将感知任务重新表述为三个可解释的步骤:分类、计数和定位——每个步骤都更自然地与LVLMs的推理能力相匹配。广泛的实验表明,我们的方法在不牺牲通用视觉语言能力的情况下显著提高了感知性能。使用标准的Qwen2.5-VL-7B-Instruct模型,CoT4Det将COCO2017验证集上的mAP从19.0%提升到33.0%,并在各种感知基准测试中取得了具有竞争力的结果,在RefCOCO系列上比基线高出+2%,在Flickr30k实体数据集上高出了19%。
https://arxiv.org/abs/2512.06663
The rise of Visual-Language Models (LVLMs) has unlocked new possibilities for seamlessly integrating visual and textual information. However, their ability to interpret cartographic maps remains largely unexplored. In this paper, we introduce CartoMapQA, a benchmark specifically designed to evaluate LVLMs' understanding of cartographic maps through question-answering tasks. The dataset includes over 2000 samples, each composed of a cartographic map, a question (with open-ended or multiple-choice answers), and a ground-truth answer. These tasks span key low-, mid- and high-level map interpretation skills, including symbol recognition, embedded information extraction, scale interpretation, and route-based reasoning. Our evaluation of both open-source and proprietary LVLMs reveals persistent challenges: models frequently struggle with map-specific semantics, exhibit limited geospatial reasoning, and are prone to Optical Character Recognition (OCR)-related errors. By isolating these weaknesses, CartoMapQA offers a valuable tool for guiding future improvements in LVLM architectures. Ultimately, it supports the development of models better equipped for real-world applications that depend on robust and reliable map understanding, such as navigation, geographic search, and urban planning. Our source code and data are openly available to the research community at: this https URL
视觉语言模型(LVLMs)的兴起开启了无缝融合视觉和文本信息的新可能。然而,这些模型解读地图的能力仍然鲜有探索。本文介绍了一个专门设计用于通过问答任务评估LVLM对地图理解能力的基准测试——CartoMapQA。该数据集包括超过2000个样本,每个样本由一张地图、一个问题(开放或多项选择答案)以及一个真实答案组成。这些任务涵盖了关键的地图解读技能,从低级到高级,包括符号识别、嵌入信息提取、比例尺解读和路线推理等。 我们对开源及专有LVLM的评估揭示了持续存在的挑战:模型在处理地图特定语义时经常遇到困难,展现出有限的空间推理能力,并且容易出现与光学字符识别(OCR)相关的错误。通过隔离这些弱点,CartoMapQA为未来改进LVLM架构提供了一个有价值的工具。最终,它支持开发更适合于依赖可靠地图理解的真实世界应用的模型,例如导航、地理搜索和城市规划。 我们的源代码和数据对研究社区开放获取:[此URL](this https URL)
https://arxiv.org/abs/2512.03558
Despite considerable progress in handwritten text recognition, paragraph-level handwritten text recognition, especially in low-resource languages, such as Hindi, Urdu and similar scripts, remains a challenging problem. These languages, often lacking comprehensive linguistic resources, require special attention to develop robust systems for accurate optical character recognition (OCR). This paper introduces BharatOCR, a novel segmentation-free paragraph-level handwritten Hindi and Urdu text recognition. We propose a ViT-Transformer Decoder-LM architecture for handwritten text recognition, where a Vision Transformer (ViT) extracts visual features, a Transformer decoder generates text sequences, and a pre-trained language model (LM) refines the output to improve accuracy, fluency, and coherence. Our model utilizes a Data-efficient Image Transformer (DeiT) model proposed for masked image modeling in this research work. In addition, we adopt a RoBERTa architecture optimized for masked language modeling (MLM) to enhance the linguistic comprehension and generative capabilities of the proposed model. The transformer decoder generates text sequences from visual embeddings. This model is designed to iteratively process a paragraph image line by line, called implicit line segmentation. The proposed model was evaluated using our custom dataset ('Parimal Urdu') and ('Parimal Hindi'), introduced in this research work, as well as two public datasets. The proposed model achieved benchmark results in the NUST-UHWR, PUCIT-OUHL, and Parimal-Urdu datasets, achieving character recognition rates of 96.24%, 92.05%, and 94.80%, respectively. The model also provided benchmark results using the Hindi dataset achieving a character recognition rate of 80.64%. The results obtained from our proposed model indicated that it outperformed several state-of-the-art Urdu text recognition methods.
尽管手写文本识别取得了显著进展,但段落级别的手写文本识别——尤其是在印地语、乌尔都语及其他类似书写系统的低资源语言中——仍然是一项具有挑战性的任务。这些语言往往缺乏全面的语言学资源,因此开发用于准确光学字符识别(OCR)的稳健系统需要特别关注。本文介绍了BharatOCR,这是一种针对手写印地语和乌尔都语段落文本的新颖无分割识别方法。 我们提出了一种基于视觉变换器-解码器语言模型架构的手写文本识别方法:一个视觉变换器(ViT)用于提取视觉特征,一个变压器解码器生成文本序列,而一个预训练的语言模型(LM)则通过提高准确度、流畅性和连贯性来优化输出。我们的模型采用了在本研究中提出的数据高效图像变换器(DeiT)架构,用于处理掩码图像建模,并采用经过优化的RoBERTa架构进行掩码语言建模,以增强所提模型的语言理解和生成能力。 变压器解码器从视觉嵌入中生成文本序列。该模型被设计为逐行迭代处理段落图像,称为隐式行分割。我们使用自定义数据集('Parimal Urdu'和'Parimal Hindi'),以及两个公开的数据集来评估提出的模型。在NUST-UHWR、PUCIT-OUHL及Parimal-Urdu数据集中,我们的模型分别实现了96.24%、92.05%及94.80%的字符识别率,并在印地语数据集上达到了80.64%的字符识别率。实验结果显示,本研究提出的模型优于多种最新的乌尔都语文本识别方法。
https://arxiv.org/abs/2512.01348
The point spread function (PSF) serves as a fundamental descriptor linking the real-world scene to the captured signal, manifesting as camera blur. Accurate PSF estimation is crucial for both optical characterization and computational vision, yet remains challenging due to the inherent ambiguity and the ill-posed nature of intensity-based deconvolution. We introduce CircleFlow, a high-fidelity PSF estimation framework that employs flow-guided edge localization for precise blur characterization. CircleFlow begins with a structured capture that encodes locally anisotropic and spatially varying PSFs by imaging a circle grid target, while leveraging the target's binary luminance prior to decouple image and kernel estimation. The latent sharp image is then reconstructed through subpixel alignment of an initialized binary structure guided by optical flow, whereas the PSF is modeled as an energy-constrained implicit neural representation. Both components are jointly optimized within a demosaicing-aware differentiable framework, ensuring physically consistent and robust PSF estimation enabled by accurate edge localization. Extensive experiments on simulated and real-world data demonstrate that CircleFlow achieves state-of-the-art accuracy and reliability, validating its effectiveness for practical PSF calibration.
点扩散函数(PSF)是将现实场景与捕获信号联系起来的基本描述符,表现为相机模糊。准确的PSF估计对于光学特性和计算视觉至关重要,但由于基于强度的去卷积固有的歧义和不适定性质而具有挑战性。我们介绍了CircleFlow,这是一种高保真度的PSF估算框架,采用流动引导边缘定位以精确表征模糊。CircleFlow从结构化捕捉开始,通过成像圆形网格目标来编码局部各向异性和空间变化的PSFs,并利用目标二进制亮度先验将图像和核估计解耦。然后,通过光学流引导初始化的二进制结构进行亚像素对齐重建潜在清晰图像,而PSF则被建模为能量受限的隐式神经表示。这两个组件在考虑到拜耳去马赛克处理的可微框架中联合优化,确保由准确边缘定位支持的物理一致性和稳健的PSF估计。在模拟和真实数据上的广泛实验表明,CircleFlow达到了最先进的精度和可靠性,验证了其在实际PSF校准中的有效性。
https://arxiv.org/abs/2512.00796
Business Process Model and Notation (BPMN) is a widely adopted standard for representing complex business workflows. While BPMN diagrams are often exchanged as visual images, existing methods primarily rely on XML representations for computational analysis. In this work, we present a pipeline that leverages Vision-Language Models (VLMs) to extract structured JSON representations of BPMN diagrams directly from images, without requiring source model files or textual annotations. We also incorporate optical character recognition (OCR) for textual enrichment and evaluate the generated element lists against ground truth data derived from the source XML files. Our approach enables robust component extraction in scenarios where original source files are unavailable. We benchmark multiple VLMs and observe performance improvements in several models when OCR is used for text enrichment. In addition, we conducted extensive statistical analyses of OCR-based enrichment methods and prompt ablation studies, providing a clearer understanding of their impact on model performance.
https://arxiv.org/abs/2511.22448