Single-image super-resolution refers to the reconstruction of a high-resolution image from a single low-resolution observation. Although recent deep learning-based methods have demonstrated notable success on simulated datasets -- with low-resolution images obtained by degrading and downsampling high-resolution ones -- they frequently fail to generalize to real-world settings, such as document scans, which are affected by complex degradations and semantic variability. In this study, we introduce a task-driven, multi-task learning framework for training a super-resolution network specifically optimized for optical character recognition tasks. We propose to incorporate auxiliary loss functions derived from high-level vision tasks, including text detection using the connectionist text proposal network, text recognition via a convolutional recurrent neural network, keypoints localization using this http URL, and hue consistency. To balance these diverse objectives, we employ dynamic weight averaging mechanism, which adaptively adjusts the relative importance of each loss term based on its convergence behavior. We validate our approach upon the SRResNet architecture, which is a well-established technique for single-image super-resolution. Experimental evaluations on both simulated and real-world scanned document datasets demonstrate that the proposed approach improves text detection, measured with intersection over union, while preserving overall image fidelity. These findings underscore the value of multi-objective optimization in super-resolution models for bridging the gap between simulated training regimes and practical deployment in real-world scenarios.
单图像超分辨率指的是从单一低分辨率观测中重建高分辨率图像的过程。尽管基于深度学习的方法在模拟数据集上取得了显著的成功——通过降级和下采样高质量的原始图像来获取低质量图像,但在实际场景如文档扫描中的应用效果却大打折扣。这些实际场景受到复杂的退化因素和语义变化的影响。为此,在这项研究中,我们介绍了一种任务驱动的多任务学习框架,用于训练专为光学字符识别(OCR)任务优化的超分辨率网络。我们的方法通过加入由高层次视觉任务导出的辅助损失函数来增强模型性能,包括使用连接主义文本提议网络进行文本检测、利用卷积递归神经网络实现文本识别、基于this http URL的方法进行关键点定位以及色相一致性。 为了平衡这些不同的目标,我们采用了一种动态权重平均机制,该机制根据每个损失项的收敛行为自适应地调整其相对重要性。我们在SRResNet架构上验证了这种方法的效果,这是一种广泛应用于单图像超分辨率处理的技术。实验结果显示,在模拟和真实世界的文档扫描数据集上的评估中,所提出的方案在提高文本检测准确率(用交并比衡量)的同时保持了整体图像的保真度。这些结果强调了多目标优化在超分辨率模型中的价值,特别是在将仿真训练与实际应用场景相结合时的作用。
https://arxiv.org/abs/2506.06953
Current smart glasses equipped with RGB cameras struggle to perceive the environment in low-light and high-speed motion scenarios due to motion blur and the limited dynamic range of frame cameras. Additionally, capturing dense images with a frame camera requires large bandwidth and power consumption, consequently draining the battery faster. These challenges are especially relevant for developing algorithms that can read text from images. In this work, we propose a novel event-based Optical Character Recognition (OCR) approach for smart glasses. By using the eye gaze of the user, we foveate the event stream to significantly reduce bandwidth by around 98% while exploiting the benefits of event cameras in high-dynamic and fast scenes. Our proposed method performs deep binary reconstruction trained on synthetic data and leverages multimodal LLMs for OCR, outperforming traditional OCR solutions. Our results demonstrate the ability to read text in low light environments where RGB cameras struggle while using up to 2400 times less bandwidth than a wearable RGB camera.
当前配备RGB摄像头的智能眼镜在低光和高速运动场景中感知环境的能力受限,因为这些情况下会出现运动模糊,并且帧相机的动态范围有限。此外,在使用帧相机捕获密集图像时,需要较大的带宽和能耗,从而导致电池消耗更快。这些问题尤其影响了开发能够从图像中读取文本的算法的工作。 在这项工作中,我们提出了一种基于事件的光学字符识别(OCR)方法来解决上述问题,该方法特别适用于智能眼镜。通过利用用户的视线,我们将事件流聚焦到用户注视的地方,这可以将带宽显著减少约98%,同时还能在高动态范围和快速变化场景中利用事件相机的优势。 我们提出的方法采用基于合成数据训练的深度二值重建技术,并利用多模态大语言模型进行OCR识别。这种方法不仅超过了传统的OCR解决方案,在低光环境中读取文本的能力上也表现出色,而且使用的带宽比可穿戴RGB相机少多达2400倍。
https://arxiv.org/abs/2506.06918
Large Language Models (LLMs) demonstrate varying performance across languages and cultural contexts. This study introduces a novel, culturally-rich, multilingual dataset derived from video recordings of the Romanian game show "Who Wants to Be a Millionaire?" (Vrei sÄ fii Milionar?). We employed an innovative process combining optical character recognition (OCR), automated text extraction, and manual verification to collect question-answer pairs, enriching them with metadata including question domain (e.g., biology, history), cultural relevance (Romanian-specific vs. international), and difficulty. Benchmarking state-of-the-art LLMs, including Romanian-adapted models, on this dataset revealed significant performance disparities: models consistently achieve higher accuracy (80-95%) on international questions compared to Romanian-specific cultural questions (50-75%). We further investigate these differences through experiments involving machine translation of Romanian questions into English and cross-lingual tests using a comparable dataset in French. Our findings underscore the impact of cultural context and data source on LLM performance and offer practical insights for building robust, culturally-aware multilingual NLP systems, especially in educational domains. The dataset is publicly available at Hugging Face.
大型语言模型(LLMs)在不同语言和文化背景下的表现存在差异。本研究引入了一个新颖且文化底蕴丰富的多语言数据集,该数据集源自罗马尼亚电视益智节目《谁想成为百万富翁?》(Vrei să fii milionar?)的视频记录。我们采用了一种创新的方法结合光学字符识别(OCR)、自动文本提取和人工验证来收集问题-答案对,并且通过添加包括问题领域(例如生物、历史)、文化相关性(特定于罗马尼亚的文化内容与国际通用的内容)以及难度等元数据,进一步丰富了这些数据。 在使用这一数据集评估当前最先进的LLM模型(包括经过罗马尼亚语适应的模型)时,我们发现显著的表现差距:所有模型在处理具有国际普遍性的题目时准确率较高(80-95%),而在面对特定于罗马尼亚文化的题目时则表现较差(50-75%)。为了进一步探讨这一差异,我们进行了实验,将罗马尼亚语问题通过机器翻译成英语,并使用与该数据集相似的法语文本进行跨语言测试。我们的发现强调了文化背景和数据来源对LLM性能的影响,并为构建稳健且具有文化意识的多语言自然语言处理(NLP)系统提供了实际见解,特别是在教育领域。 该数据集已在Hugging Face上公开发布。
https://arxiv.org/abs/2506.05991
Foundational to the Chinese language and culture, Chinese characters encompass extraordinarily extensive and ever-expanding categories, with the latest Chinese GB18030-2022 standard containing 87,887 categories. The accurate recognition of this vast number of characters, termed mega-category recognition, presents a formidable yet crucial challenge for cultural heritage preservation and digital applications. Despite significant advances in Optical Character Recognition (OCR), mega-category recognition remains unexplored due to the absence of comprehensive datasets, with the largest existing dataset containing merely 16,151 categories. To bridge this critical gap, we introduce MegaHan97K, a mega-category, large-scale dataset covering an unprecedented 97,455 categories of Chinese characters. Our work offers three major contributions: (1) MegaHan97K is the first dataset to fully support the latest GB18030-2022 standard, providing at least six times more categories than existing datasets; (2) It effectively addresses the long-tail distribution problem by providing balanced samples across all categories through its three distinct subsets: handwritten, historical and synthetic subsets; (3) Comprehensive benchmarking experiments reveal new challenges in mega-category scenarios, including increased storage demands, morphologically similar character recognition, and zero-shot learning difficulties, while also unlocking substantial opportunities for future research. To the best of our knowledge, the MetaHan97K is likely the dataset with the largest classes not only in the field of OCR but may also in the broader domain of pattern recognition. The dataset is available at this https URL.
中文语言和文化的基础是中国汉字,它们包含极其广泛且不断扩大的类别。最新的《中华人民共和国国家标准GB18030-2022》包含了多达87,887个字符类别。准确识别这些海量字符,即所谓的“大规模类别识别”,对文化遗产保护及数字应用提出了一个艰巨但至关重要的挑战。尽管光学字符识别(OCR)技术取得了显著进展,但由于缺乏全面的数据集支持,大规模类别识别尚未被深入探索,目前已知的最大数据集仅包含16,151个类别。为填补这一关键空白,我们推出了MegaHan97K,这是一个涵盖前所未有的97,455个汉字类别的大规模数据集。 我们的工作做出了三大主要贡献: (1)MegaHan97K是首个全面支持最新GB18030-2022标准的数据集,并且其类别数量至少为现有数据集的六倍; (2)通过提供包括手写、历史和合成三种不同子集在内的平衡样本,它有效解决了长尾分布问题; (3)全面基准测试实验揭示了大规模场景下的新挑战,如增加的存储需求、形态相似字符识别难度以及零样本学习障碍,并为未来研究开辟了巨大机会。 据我们所知,MetaHan97K可能是OCR领域乃至更广泛的模式识别领域的类别最多的数据集。该数据集可在以下网址获取:[https://example.com](请将链接替换为您实际的数据集发布页面)。
https://arxiv.org/abs/2506.04807
This paper presents the first study on adapting the visual in-context learning (V-ICL) paradigm to optical character recognition tasks, specifically focusing on text removal and segmentation. Most existing V-ICL generalists employ a reasoning-as-reconstruction approach: they turn to using a straightforward image-label compositor as the prompt and query input, and then masking the query label to generate the desired output. This direct prompt confines the model to a challenging single-step reasoning process. To address this, we propose a task-chaining compositor in the form of image-removal-segmentation, providing an enhanced prompt that elicits reasoning with enriched intermediates. Additionally, we introduce context-aware aggregation, integrating the chained prompt pattern into the latent query representation, thereby strengthening the model's in-context reasoning. We also consider the issue of visual heterogeneity, which complicates the selection of homogeneous demonstrations in text recognition. Accordingly, this is effectively addressed through a simple self-prompting strategy, preventing the model's in-context learnability from devolving into specialist-like, context-free inference. Collectively, these insights culminate in our ConText model, which achieves new state-of-the-art across both in- and out-of-domain benchmarks. The code is available at this https URL.
本文介绍了首个将视觉上下文学习(V-ICL)范式应用于光学字符识别任务的研究,具体聚焦于文本去除和分割。大多数现有的通用型V-ICL模型采用的是重构推理方法:它们使用简单的图像标签合成器作为提示和查询输入,然后通过屏蔽查询标签来生成所需输出。这种直接的提示将模型限制在一个具有挑战性的单步推理过程中。为了解决这一问题,我们提出了一种以“图像去除分割”形式的任务链式合成器,提供了增强的提示,使模型能够基于更加丰富的中间结果进行推理。 此外,我们还引入了上下文感知聚合方法,将链条式的提示模式整合到潜在查询表示中,从而增强了模型在上下文中进行推理的能力。同时,我们也考虑到了视觉异质性的问题,这使得选择同质化的演示示例在文本识别中变得复杂化。为此,通过一个简单的自我提示策略有效解决了这一问题,防止了模型的上下文学习能力退化为缺乏背景信息的专家式推断。 综上所述,这些见解汇聚成了我们的ConText模型,在域内和跨域基准测试中均达到了新的最先进水平。代码可在以下链接获取:[此URL]。
https://arxiv.org/abs/2506.03799
Visually impaired individuals face significant challenges navigating and interacting with unknown situations, particularly in tasks requiring spatial awareness and semantic scene understanding. To accelerate the development and evaluate the state of technologies that enable visually impaired people to solve these tasks, the Vision Assistance Race (VIS) at the Cybathlon 2024 competition was organized. In this work, we present Sight Guide, a wearable assistive system designed for the VIS. The system processes data from multiple RGB and depth cameras on an embedded computer that guides the user through complex, real-world-inspired tasks using vibration signals and audio commands. Our software architecture integrates classical robotics algorithms with learning-based approaches to enable capabilities such as obstacle avoidance, object detection, optical character recognition, and touchscreen interaction. In a testing environment, Sight Guide achieved a 95.7% task success rate, and further demonstrated its effectiveness during the Cybathlon competition. This work provides detailed insights into the system design, evaluation results, and lessons learned, and outlines directions towards a broader real-world applicability.
视力障碍者在面对未知环境时面临着重大挑战,尤其是在需要空间意识和语义场景理解的任务中。为了加速开发并评估帮助视障人士解决此类任务的技术状态,Cybathlon 2024 竞赛组织了视觉辅助竞赛(VIS)。在此项工作中,我们介绍了 Sight Guide,这是一个为 VIS 设计的可穿戴助行系统。该系统利用嵌入式计算机处理来自多个 RGB 和深度相机的数据,并通过振动信号和音频指令引导用户完成复杂的、基于现实世界的任务。我们的软件架构结合了经典机器人算法与基于学习的方法,以实现障碍物规避、物体检测、光学字符识别以及触摸屏交互等能力。在测试环境中,Sight Guide 实现了 95.7% 的任务成功率,并且在 Cybathlon 竞赛中进一步证明了其实用性。这项工作详细介绍了系统设计、评估结果和经验教训,并指出了迈向更广泛现实世界应用的方向。
https://arxiv.org/abs/2506.02676
The inherent complexities of Arabic script; its cursive nature, diacritical marks (tashkeel), and varied typography, pose persistent challenges for Optical Character Recognition (OCR). We present Qari-OCR, a series of vision-language models derived from Qwen2-VL-2B-Instruct, progressively optimized for Arabic through iterative fine-tuning on specialized synthetic datasets. Our leading model, QARI v0.2, establishes a new open-source state-of-the-art with a Word Error Rate (WER) of 0.160, Character Error Rate (CER) of 0.061, and BLEU score of 0.737 on diacritically-rich texts. Qari-OCR demonstrates superior handling of tashkeel, diverse fonts, and document layouts, alongside impressive performance on low-resolution images. Further explorations (QARI v0.3) showcase strong potential for structural document understanding and handwritten text. This work delivers a marked improvement in Arabic OCR accuracy and efficiency, with all models and datasets released to foster further research.
阿拉伯文脚本的内在复杂性,包括其连笔特性、标点符号(tashkeel)和多样的字体类型,给光学字符识别(OCR)带来了持续的挑战。我们介绍了Qari-OCR,一系列基于Qwen2-VL-2B-Instruct的语言视觉模型,通过在专门合成的数据集上的迭代微调,这些模型逐渐优化以适应阿拉伯文。我们的领先模型QARI v0.2,在标有点缀符号丰富的文本上达到了新的开源最先进水平,其词错误率(WER)为0.160,字符错误率(CER)为0.061,BLEU得分为0.737。Qari-OCR展示了处理tashkeel、多种字体和文档布局的卓越能力,并在低分辨率图像上也表现出色。进一步的研究成果(QARI v0.3)展现了对结构化文档理解和手写文本的强大潜力。这项工作显著提升了阿拉伯文OCR的准确性和效率,所有模型和数据集均已开源以促进进一步研究。
https://arxiv.org/abs/2506.02295
Kwak'wala is an Indigenous language spoken in British Columbia, with a rich legacy of published documentation spanning more than a century, and an active community of speakers, teachers, and learners engaged in language revitalization. Over 11 volumes of the earliest texts created during the collaboration between Franz Boas and George Hunt have been scanned but remain unreadable by machines. Complete digitization through optical character recognition has the potential to facilitate transliteration into modern orthographies and the creation of other language technologies. In this paper, we apply the latest OCR techniques to a series of Kwak'wala texts only accessible as images, and discuss the challenges and unique adaptations necessary to make such technologies work for these real-world texts. Building on previous methods, we propose using a mix of off-the-shelf OCR methods, language identification, and masking to effectively isolate Kwak'wala text, along with post-correction models, to produce a final high-quality transcription.
克瓦卡利语(Kwak'wala)是一种在不列颠哥伦比亚省使用的土著语言,拥有超过一个世纪的出版文献传统,并且有一个活跃的语言复兴社区,包括说者、教师和学习者。最早由弗朗茨·鲍亚斯与乔治·亨特合作创作的11卷文本已经被扫描,但机器仍无法读取这些资料。通过光学字符识别进行完整的数字化有可能促进现代正字法下的转写,并创建其他语言技术。在本文中,我们应用最新的OCR(光学字符识别)技术来处理一系列仅以图像形式存在的克瓦卡利语文本,并讨论将此类技术应用于实际文本时面临的挑战及必要的独特调整方法。基于以往的方法,我们建议结合现成的OCR方法、语言识别和屏蔽技术有效分离出克瓦卡利语文本,并使用后期校正模型生成高质量的转录版本。
https://arxiv.org/abs/2506.01775
Arabic Optical Character Recognition (OCR) is essential for converting vast amounts of Arabic print media into digital formats. However, training modern OCR models, especially powerful vision-language models, is hampered by the lack of large, diverse, and well-structured datasets that mimic real-world book layouts. Existing Arabic OCR datasets often focus on isolated words or lines or are limited in scale, typographic variety, or structural complexity found in books. To address this significant gap, we introduce SARD (Large-Scale Synthetic Arabic OCR Dataset). SARD is a massive, synthetically generated dataset specifically designed to simulate book-style documents. It comprises 843,622 document images containing 690 million words, rendered across ten distinct Arabic fonts to ensure broad typographic coverage. Unlike datasets derived from scanned documents, SARD is free from real-world noise and distortions, offering a clean and controlled environment for model training. Its synthetic nature provides unparalleled scalability and allows for precise control over layout and content variation. We detail the dataset's composition and generation process and provide benchmark results for several OCR models, including traditional and deep learning approaches, highlighting the challenges and opportunities presented by this dataset. SARD serves as a valuable resource for developing and evaluating robust OCR and vision-language models capable of processing diverse Arabic book-style texts.
阿拉伯光学字符识别(OCR)对于将大量阿拉伯印刷媒体转换为数字格式至关重要。然而,训练现代OCR模型,尤其是强大的视觉语言模型,由于缺乏大规模、多样且结构良好的数据集来模拟真实世界的书籍布局而受到阻碍。现有的阿拉伯语OCR数据集往往集中于孤立的单词或行,或者在规模、字体变化或书籍中发现的结构复杂性方面存在局限性。为了解决这一重要缺口,我们介绍了SARD(大规模合成阿拉伯语OCR数据集)。SARD是一个巨大的、由合成生成的数据集,专门设计用于模拟书本风格的文档。它包含843,622张文件图像,其中包含了以十种不同的阿拉伯字体呈现的6.9亿个单词,以确保广泛的排版覆盖率。与从扫描文档中提取的数据集不同,SARD不受现实世界噪声和失真的影响,为模型训练提供了干净且受控的环境。其合成特性提供了前所未有的可扩展性,并允许对布局和内容变化进行精确控制。我们详细介绍了该数据集的组成及生成过程,并提供了几种OCR模型(包括传统方法和深度学习方法)的基准测试结果,突出展示了此数据集带来的挑战与机遇。SARD为开发和评估能够处理多样化阿拉伯书本风格文本的强大OCR和视觉语言模型提供了宝贵的资源。
https://arxiv.org/abs/2505.24600
Handwritten Mathematical Expression Recognition (HMER) remains a persistent challenge in Optical Character Recognition (OCR) due to the inherent freedom of symbol layout and variability in handwriting styles. Prior methods have faced performance bottlenecks, proposing isolated architectural modifications that are difficult to integrate coherently into a unified framework. Meanwhile, recent advances in pretrained vision-language models (VLMs) have demonstrated strong cross-task generalization, offering a promising foundation for developing unified solutions. In this paper, we introduce Uni-MuMER, which fully fine-tunes a VLM for the HMER task without modifying its architecture, effectively injecting domain-specific knowledge into a generalist framework. Our method integrates three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structured spatial reasoning, Error-Driven Learning (EDL) for reducing confusion among visually similar characters, and Symbol Counting (SC) for improving recognition consistency in long expressions. Experiments on the CROHME and HME100K datasets show that Uni-MuMER achieves new state-of-the-art performance, surpassing the best lightweight specialized model SSAN by 16.31% and the top-performing VLM Gemini2.5-flash by 24.42% in the zero-shot setting. Our datasets, models, and code are open-sourced at: this https URL
手写数学表达式识别(HMER)在光学字符识别(OCR)中仍然是一项持续的挑战,这是由于符号布局的自由度和书写风格的变化所致。先前的方法由于提出孤立的架构修改而遇到了性能瓶颈,这些修改难以整合到一个统一框架内。与此同时,近期预训练视觉语言模型(VLMs)在跨任务泛化方面表现出色,为开发统一体解决方案提供了有前景的基础。本文中,我们介绍了一种名为Uni-MuMER的方法,该方法无需修改其架构即可对VLM进行全面微调,从而有效地将领域特定知识注入到通用框架中。我们的方法整合了三项数据驱动的任务:结构化空间推理的树感知链式思维(Tree-CoT)、减少视觉相似字符混淆的错误驱动学习(EDL)以及提高长表达式识别一致性的符号计数(SC)。在CROHME和HME100K数据集上的实验表明,Uni-MuMER达到了新的最先进的性能,在零样本设置下,其表现优于最轻量级的专业模型SSAN 16.31%,优于顶级的VLM Gemini2.5-flash 24.42%。 我们的数据集、模型和代码已开源:[请在此处插入链接](请注意原文中的具体网址无法直接显示,请访问原论文获取准确的URL)。
https://arxiv.org/abs/2505.23566
While recent advancements in Image Super-Resolution (SR) using diffusion models have shown promise in improving overall image quality, their application to scene text images has revealed limitations. These models often struggle with accurate text region localization and fail to effectively model image and multilingual character-to-shape priors. This leads to inconsistencies, the generation of hallucinated textures, and a decrease in the perceived quality of the super-resolved text. To address these issues, we introduce TextSR, a multimodal diffusion model specifically designed for Multilingual Scene Text Image Super-Resolution. TextSR leverages a text detector to pinpoint text regions within an image and then employs Optical Character Recognition (OCR) to extract multilingual text from these areas. The extracted text characters are then transformed into visual shapes using a UTF-8 based text encoder and cross-attention. Recognizing that OCR may sometimes produce inaccurate results in real-world scenarios, we have developed two innovative methods to enhance the robustness of our model. By integrating text character priors with the low-resolution text images, our model effectively guides the super-resolution process, enhancing fine details within the text and improving overall legibility. The superior performance of our model on both the TextZoom and TextVQA datasets sets a new benchmark for STISR, underscoring the efficacy of our approach.
尽管最近使用扩散模型在图像超分辨率(SR)方面取得了进展,这些方法在提高整体图像质量方面显示出潜力,但它们在应用于场景文本图像时却暴露出了一些局限性。这些模型往往难以准确地定位文本区域,并且无法有效建模图像和多语言字符到形状的先验知识,这导致了不一致、生成幻觉纹理以及超分辨率后文本感知质量下降的问题。 为了应对这些问题,我们引入了一种专为多语言场景文本图像超分辨率设计的TextSR多模式扩散模型。TextSR利用一个文本检测器来识别图像中的文本区域,并使用光学字符识别(OCR)从这些区域中提取多语言文本。然后,通过基于UTF-8的文本编码和跨注意力机制将提取出的文字转换为视觉形状。 考虑到在实际场景中,OCR可能会产生不准确的结果,我们开发了两种创新的方法来增强模型的鲁棒性。通过结合低分辨率图像中的字符先验知识,我们的模型能够有效地指导超分辨率过程,提高文字细节并提升整体可读性。我们在TextZoom和TextVQA数据集上的测试结果表明,我们的方法在场景文本图像超分辨率(STISR)领域树立了新的性能标杆,这进一步证明了我们方法的有效性和优越性。
https://arxiv.org/abs/2505.23119
Multimodal Large Language Models (MLLMs) have achieved considerable accuracy in Optical Character Recognition (OCR) from static images. However, their efficacy in video OCR is significantly diminished due to factors such as motion blur, temporal variations, and visual effects inherent in video content. To provide clearer guidance for training practical MLLMs, we introduce the MME-VideoOCR benchmark, which encompasses a comprehensive range of video OCR application scenarios. MME-VideoOCR features 10 task categories comprising 25 individual tasks and spans 44 diverse scenarios. These tasks extend beyond text recognition to incorporate deeper comprehension and reasoning of textual content within videos. The benchmark consists of 1,464 videos with varying resolutions, aspect ratios, and durations, along with 2,000 meticulously curated, manually annotated question-answer pairs. We evaluate 18 state-of-the-art MLLMs on MME-VideoOCR, revealing that even the best-performing model (Gemini-2.5 Pro) achieves an accuracy of only 73.7%. Fine-grained analysis indicates that while existing MLLMs demonstrate strong performance on tasks where relevant texts are contained within a single or few frames, they exhibit limited capability in effectively handling tasks that demand holistic video comprehension. These limitations are especially evident in scenarios that require spatio-temporal reasoning, cross-frame information integration, or resistance to language prior bias. Our findings also highlight the importance of high-resolution visual input and sufficient temporal coverage for reliable OCR in dynamic video scenarios.
多模态大型语言模型(MLLM)在静态图像的光学字符识别(OCR)方面已经取得了相当高的准确度。然而,由于视频内容中存在的运动模糊、时间变化和视觉效果等因素,它们在视频OCR中的有效性显著降低。为了为训练实用的MLLM提供更清晰的指导,我们引入了MME-VideoOCR基准测试,它涵盖了视频OCR应用的各种场景。MME-VideoOCR包含10个任务类别,共计25项独立任务,并涵盖44种多样化的应用场景。这些任务不仅限于文本识别,还涉及对视频中文字内容的深入理解和推理。该基准测试包括具有不同分辨率、宽高比和时长的1,464段视频,以及精心策划并人工标注了问答配对的2,000个问题-答案对。 我们在MME-VideoOCR上评估了18种最先进的MLLM模型,结果表明即使是表现最好的模型(Gemini-2.5 Pro)也只能达到73.7%的准确率。精细分析显示,现有的MLLM在处理包含相关文本仅限于单帧或少数几帧的任务时表现出色,但在需要全面视频理解的任务上能力有限。这种局限性尤其体现在要求时空推理、跨帧信息整合或是抵抗语言先入为主偏差的应用场景中。 我们的研究结果还强调了在动态视频环境中进行可靠OCR所需高分辨率视觉输入和充足的时间覆盖的重要性。
https://arxiv.org/abs/2505.21333
Text Image Machine Translation (TIMT)-the task of translating textual content embedded in images-is critical for applications in accessibility, cross-lingual information access, and real-world document understanding. However, TIMT remains a complex challenge due to the need for accurate optical character recognition (OCR), robust visual-text reasoning, and high-quality translation, often requiring cascading multi-stage pipelines. Recent advances in large-scale Reinforcement Learning (RL) have improved reasoning in Large Language Models (LLMs) and Multimodal LLMs (MLLMs), but their application to end-to-end TIMT is still underexplored. To bridge this gap, we introduce MT$^{3}$, the first framework to apply Multi-Task RL to MLLMs for end-to-end TIMT. MT$^{3}$ adopts a multi-task optimization paradigm targeting three key sub-skills: text recognition, context-aware reasoning, and translation. It is trained using a novel multi-mixed reward mechanism that adapts rule-based RL strategies to TIMT's intricacies, offering fine-grained, non-binary feedback across tasks. Furthermore, to facilitate the evaluation of TIMT in authentic cross-cultural and real-world social media contexts, we introduced XHSPost, the first social media TIMT benchmark. Our MT$^{3}$-7B-Zero achieves state-of-the-art results on the latest in-domain MIT-10M benchmark, outperforming strong baselines such as Qwen2.5-VL-72B and InternVL2.5-78B by notable margins across multiple metrics. Additionally, the model shows strong generalization to out-of-distribution language pairs and datasets. In-depth analyses reveal how multi-task synergy, reinforcement learning initialization, curriculum design, and reward formulation contribute to advancing MLLM-driven TIMT.
文本图像机器翻译(TIMT)——即对嵌入在图片中的文字内容进行翻译的任务,在可访问性、跨语言信息检索和现实世界文档理解的应用中至关重要。然而,由于需要准确的光学字符识别(OCR)、稳健的文字-视觉推理以及高质量的翻译,这一任务仍然面临诸多挑战,通常需采用多阶段级联管道来解决这些问题。近期的大规模强化学习(RL)在大型语言模型(LLMs)和多模态LLM(MLLMs)中的应用显著提升了这些模型的推理能力,但它们在端到端TIMT上的应用仍处于探索阶段。 为填补这一空白,我们引入了MT$^{3}$,这是一个首次将多任务强化学习应用于MLLM以实现端到端TIMT的框架。MT$^{3}$采用了一种面向三大关键子技能(文字识别、基于上下文的理解和翻译)的多任务优化范式,并通过一种新颖的混合奖励机制进行训练,该机制适应了TIMT特有的复杂性,并提供了跨任务的精细且非二元化的反馈。此外,为了评估在真正的跨国文化和社会媒体语境下的TIMT性能,我们推出了XHSPost,这是首个专门针对社会媒体TIMT基准测试的设计。 我们的MT$^{3}$-7B-Zero模型在最新的MIT-10M内部数据集上取得了当前最佳的结果,并且超越了包括Qwen2.5-VL-72B和InternVL2.5-78B在内的强劲基线模型,跨多个指标都有显著优势。此外,该模型还展示了对分布外语言对和数据集的强大泛化能力。 深入分析表明,多任务协同作用、强化学习初始化、课程设计以及奖励构成都为推动MLLM驱动的TIMT的发展做出了贡献。
https://arxiv.org/abs/2505.19714
Document alignment and registration play a crucial role in numerous real-world applications, such as automated form processing, anomaly detection, and workflow automation. Traditional methods for document alignment rely on image-based features like keypoints, edges, and textures to estimate geometric transformations, such as homographies. However, these approaches often require access to the original document images, which may not always be available due to privacy, storage, or transmission constraints. This paper introduces a novel approach that leverages Optical Character Recognition (OCR) outputs as features for homography estimation. By utilizing the spatial positions and textual content of OCR-detected words, our method enables document alignment without relying on pixel-level image data. This technique is particularly valuable in scenarios where only OCR outputs are accessible. Furthermore, the method is robust to OCR noise, incorporating RANSAC to handle outliers and inaccuracies in the OCR data. On a set of test documents, we demonstrate that our OCR-based approach even performs more accurately than traditional image-based methods, offering a more efficient and scalable solution for document registration tasks. The proposed method facilitates applications in document processing, all while reducing reliance on high-dimensional image data.
文档对齐和注册在诸如自动化表单处理、异常检测以及工作流自动化等众多实际应用场景中发挥着关键作用。传统的文档对齐方法依赖于基于图像的特征,如关键点、边缘和纹理来估计几何变换(例如仿射变换)。然而,这些方法通常需要访问原始文档图像,但由于隐私保护、存储或传输限制等原因,这种情况并不总是可行。 本文介绍了一种新颖的方法,该方法利用光学字符识别(OCR)输出作为特征来进行仿射变换的估算。通过利用OCR检测到的文字的空间位置和文本内容,我们的方法能够在不依赖像素级图像数据的情况下实现文档对齐。在只有OCR输出可用的情境下,这种技术特别有价值。 此外,我们提出的方法具有鲁棒性,能够处理OCR噪声,并结合RANSAC算法来应对OCR数据中的异常值和准确性问题。通过一组测试文件,我们证明了基于OCR的方法甚至比传统的基于图像的方法表现出更高的精度,提供了一种更高效且可扩展的文档注册解决方案。 该方法促进了文档处理的应用发展,并减少了对高维图像数据的依赖。
https://arxiv.org/abs/2505.18925
Despite significant advancements in Large Vision Language Models (LVLMs), a gap remains, particularly regarding their interpretability and how they locate and interpret textual information within images. In this paper, we explore various LVLMs to identify the specific heads responsible for recognizing text from images, which we term the Optical Character Recognition Head (OCR Head). Our findings regarding these heads are as follows: (1) Less Sparse: Unlike previous retrieval heads, a large number of heads are activated to extract textual information from images. (2) Qualitatively Distinct: OCR heads possess properties that differ significantly from general retrieval heads, exhibiting low similarity in their characteristics. (3) Statically Activated: The frequency of activation for these heads closely aligns with their OCR scores. We validate our findings in downstream tasks by applying Chain-of-Thought (CoT) to both OCR and conventional retrieval heads and by masking these heads. We also demonstrate that redistributing sink-token values within the OCR heads improves performance. These insights provide a deeper understanding of the internal mechanisms LVLMs employ in processing embedded textual information in images.
尽管大型视觉语言模型(LVLM)取得了显著进展,但在这些模型的可解释性和它们如何在图像中定位和解读文本信息方面仍存在差距。在这篇论文中,我们探索了各种LVLM以识别负责从图像中提取文本信息的具体头部,并将其称为光学字符识别头部(OCR 头)。关于这些头部的发现如下: 1. **较少稀疏**:与之前的检索头部不同,大量头部被激活来提取图像中的文本信息。 2. **定性差异显著**:OCR 头部具有与通用检索头部大不相同的特性,并且它们的特点之间相似度较低。 3. **静态激活**:这些头部的激活频率与其 OCR 分数紧密相关。 为了验证我们的发现,我们在下游任务中应用了链式思维(CoT)方法对OCR头和传统检索头进行了测试,并通过屏蔽这些头部来观察效果。我们还证明,在OCR 头部内重新分配sink-token值可以提高性能。 这些见解为理解LVLM在处理图像中的嵌入文本信息时所采用的内部机制提供了更深入的理解。
https://arxiv.org/abs/2505.15865
This paper introduces a comprehensive end-to-end pipeline for Optical Character Recognition (OCR) on Urdu newspapers. In our approach, we address the unique challenges of complex multi-column layouts, low-resolution archival scans, and diverse font styles. Our process decomposes the OCR task into four key modules: (1) article segmentation, (2) image super-resolution, (3) column segmentation, and (4) text recognition. For article segmentation, we fine-tune and evaluate YOLOv11x to identify and separate individual articles from cluttered layouts. Our model achieves a precision of 0.963 and mAP@50 of 0.975. For super-resolution, we fine-tune and benchmark the SwinIR model (reaching 32.71 dB PSNR) to enhance the quality of degraded newspaper scans. To do our column segmentation, we use YOLOv11x to separate columns in text to further enhance performance - this model reaches a precision of 0.970 and mAP@50 of 0.975. In the text recognition stage, we benchmark a range of LLMs from different families, including Gemini, GPT, Llama, and Claude. The lowest WER of 0.133 is achieved by Gemini-2.5-Pro.
本文介绍了一种针对乌尔都语报纸的光学字符识别(OCR)的端到端综合管道。我们的方法解决了复杂多栏布局、低分辨率档案扫描和多样化字体风格的独特挑战。该过程将OCR任务分解为四个关键模块:(1) 文章分割,(2) 图像超分辨率,(3) 栏分割,以及 (4) 文本识别。 对于文章分割,我们微调并评估了YOLOv11x模型,以从复杂布局中识别和分离出单独的文章。我们的模型达到了0.963的精确度(precision)和0.975的mAP@50指标。在超分辨率阶段,我们微调并基准测试了SwinIR模型,使其达到32.71 dB PSNR,以提高受损报纸扫描的质量。对于栏分割,我们使用YOLOv11x将文本中的各栏目分离出来,进一步提升性能——该模型达到了0.970的精确度和0.975的mAP@50指标。 在文本识别阶段,我们测试了一系列来自不同家族的大规模语言模型(LLM),包括Gemini、GPT、Llama 和 Claude。最低词错误率(WER)为 0.133 的是由 Gemini-2.5-Pro 模型实现的。
https://arxiv.org/abs/2505.13943
Large Multimodal Models (LMMs) have become increasingly versatile, accompanied by impressive Optical Character Recognition (OCR) related capabilities. Existing OCR-related benchmarks emphasize evaluating LMMs' abilities of relatively simple visual question answering, visual-text parsing, etc. However, the extent to which LMMs can deal with complex logical reasoning problems based on OCR cues is relatively unexplored. To this end, we introduce the Reasoning-OCR benchmark, which challenges LMMs to solve complex reasoning problems based on the cues that can be extracted from rich visual-text. Reasoning-OCR covers six visual scenarios and encompasses 150 meticulously designed questions categorized into six reasoning challenges. Additionally, Reasoning-OCR minimizes the impact of field-specialized knowledge. Our evaluation offers some insights for proprietary and open-source LMMs in different reasoning challenges, underscoring the urgent to improve the reasoning performance. We hope Reasoning-OCR can inspire and facilitate future research on enhancing complex reasoning ability based on OCR cues. Reasoning-OCR is publicly available at this https URL.
大型多模态模型(LMMs)已经变得越来越灵活,并且伴随有令人印象深刻的光学字符识别(OCR)相关能力。现有的OCR相关的基准测试主要侧重于评估LMM在相对简单的视觉问答、图文解析等方面的能力,然而,LMM基于OCR线索处理复杂逻辑推理问题的程度则较少被探索。为此,我们引入了Reasoning-OCR基准测试,该测试挑战LMM根据从丰富图文中提取的线索解决复杂的推理问题。Reasoning-OCR涵盖了六种视觉场景,并包含了150个精心设计的问题,这些问题分为六个推理挑战类别。此外,Reasoning-OCR尽可能减少了领域专业知识的影响。我们的评估为专有和开源的LMM在不同推理挑战中的表现提供了一些见解,强调了提升推理性能的紧迫性。我们希望Reasoning-OCR能激发并促进未来基于OCR线索增强复杂推理能力的研究。Reasoning-OCR可在[此处](https://this https URL)公开获取。 以下是更准确的翻译: 大型多模态模型(LMMs)已变得越来越灵活,并且伴随有令人印象深刻的光学字符识别(OCR)相关能力。现有的OCR相关的基准测试主要侧重于评估LMM在相对简单的视觉问答、图文解析等方面的能力,但基于OCR线索处理复杂逻辑推理问题的程度则较少被探索。为此,我们引入了Reasoning-OCR基准测试,该测试挑战LMM根据从丰富图文中提取的线索解决复杂的推理问题。Reasoning-OCR涵盖了六种视觉场景,并包含了150个精心设计的问题,这些问题是按照六个推理挑战进行分类的。此外,Reasoning-OCR尽可能减少了领域专业知识的影响。我们的评估为专有和开源的LMM在不同推理挑战中的表现提供了一些见解,强调了提升推理性能的紧迫性。我们希望Reasoning-OCR能激发并促进未来基于OCR线索增强复杂推理能力的研究。Reasoning-OCR可在此[链接](https://this https URL)公开获取。
https://arxiv.org/abs/2505.12766
Recent advances in Large Multimodal Models (LMMs) have significantly improved their reasoning and Optical Character Recognition (OCR) capabilities. However, their performance on complex logical reasoning tasks involving text-rich images remains underexplored. To bridge this gap, we introduce LogicOCR, a benchmark comprising 1,100 multiple-choice questions designed to evaluate LMMs' logical reasoning abilities on text-rich images, while minimizing reliance on domain-specific knowledge (e.g., mathematics). We construct LogicOCR by curating a text corpus from the Chinese National Civil Servant Examination and develop a scalable, automated pipeline to convert it into multimodal samples. First, we design prompt templates to steer GPT-Image-1 to generate images with diverse backgrounds, interleaved text-illustration layouts, and varied fonts, ensuring contextual relevance and visual realism. Then, the generated images are manually verified, with low-quality examples discarded. We evaluate a range of representative open-source and proprietary LMMs under both Chain-of-Thought (CoT) and direct-answer settings. Our multi-dimensional analysis reveals key insights, such as the impact of test-time scaling, input modality differences, and sensitivity to visual-text orientation. Notably, LMMs still lag in multimodal reasoning compared to text-only inputs, indicating that they have not fully bridged visual reading with reasoning. We hope LogicOCR will serve as a valuable resource for advancing multimodal reasoning research. The dataset is available at this https URL.
最近在大型多模态模型(LMM)方面的进展显著提升了其推理能力和光学字符识别(OCR)能力。然而,这些模型在处理包含大量文本的复杂逻辑推理任务上的表现仍有待探索。为了填补这一空白,我们引入了LogicOCR,这是一个由1,100个多项选择题组成的基准测试集,旨在评估LMM们在富含文字图像中的逻辑推理能力,同时尽量减少对特定领域知识(如数学)的依赖。通过从中国国家公务员考试中精心挑选文本语料库,并开发出一个可扩展、自动化的管道将其转化为多模态样本,我们构建了LogicOCR。 首先,我们设计了提示模板来引导GPT-Image-1生成具有多样背景、交错文字和插图布局以及多种字体的图像,确保上下文相关性和视觉真实性。然后,我们会对生成的图像进行人工验证,并剔除质量低下的例子。 在链式思维(CoT)设置和直接答案设置下,我们评估了一系列代表性的开源和专有LMMs。我们的多维度分析揭示了关键洞察,例如测试时间缩放的影响、输入模式差异以及视觉文本方向的敏感性。值得注意的是,尽管取得了进步,但LMM在处理模态推理时仍落后于纯文本输入的表现,这表明它们尚未完全将视觉阅读与推理结合在一起。 我们希望LogicOCR能成为推进多模态推理研究的重要资源。该数据集可在以下链接获取:[此URL](请根据实际情况替换为实际的URL)。
https://arxiv.org/abs/2505.12307
This paper presents an end-to-end suite for multilingual information extraction and processing from image-based documents. The system uses Optical Character Recognition (Tesseract) to extract text in languages such as English, Hindi, and Tamil, and then a pipeline involving large language model APIs (Gemini) for cross-lingual translation, abstractive summarization, and re-translation into a target language. Additional modules add sentiment analysis (TensorFlow), topic classification (Transformers), and date extraction (Regex) for better document comprehension. Made available in an accessible Gradio interface, the current research shows a real-world application of libraries, models, and APIs to close the language gap and enhance access to information in image media across different linguistic environments
本文介绍了一种端到端的多语言信息提取和处理套件,专门用于基于图像的文档。该系统使用光学字符识别(Tesseract)来提取英语、印地语和泰米尔语等语言的文字内容,并随后通过一系列管道应用大规模语言模型API(Gemini),执行跨语言翻译、摘要生成以及将结果重新翻译为目标语言的操作。另外还包含一些附加模块,利用TensorFlow进行情感分析,使用Transformers框架完成主题分类,并用正则表达式提取日期信息,以提升文档理解的准确性。该系统通过Gradio界面提供了一个直观易用的访问方式,展示了如何在现实世界的应用中结合库、模型和API来缩小语言差距并增强不同语言环境下图像媒体的信息获取能力。
https://arxiv.org/abs/2505.11177
This paper evaluates the performance of Large Multimodal Models (LMMs) on Optical Character Recognition (OCR) in the low-resource Pashto language. Natural Language Processing (NLP) in Pashto faces several challenges due to the cursive nature of its script and a scarcity of structured datasets. To address this, we developed a synthetic Pashto OCR dataset, PsOCR, consisting of one million images annotated with bounding boxes at word, line, and document levels, suitable for training and evaluating models based on different architectures, including Convolutional Neural Networks (CNNs) and Transformers. PsOCR covers variations across 1,000 unique font families, colors, image sizes, and layouts. A benchmark subset of 10K images was selected to evaluate the performance of several LMMs, including seven open-source models: DeepSeek's Janus, InternVL, MiniCPM, Florence, and Qwen (3B and 7B), and four closed-source models: GPT-4o, Gemini, Claude, and Grok. Experimental results demonstrate that Gemini achieves the best performance among all models, whereas among open-source models, Qwen-7B stands out. This work provides an insightful assessment of the capabilities and limitations of current LMMs for OCR tasks in Pashto and establishes a foundation for further research not only in Pashto OCR but also for other similar scripts such as Arabic, Persian, and Urdu. PsOCR is available at this https URL.
本文评估了大型多模态模型(LMM)在低资源语言普什图语的光学字符识别(OCR)任务上的性能。由于普什图语书写系统的连笔特性以及结构化数据的缺乏,该语言的自然语言处理(NLP)面临诸多挑战。为此,我们开发了一个合成的普什图语OCR数据集PsOCR,包含一百万张图像,并用边界框在单词、行和文档级别进行了标注,适用于基于不同架构(包括卷积神经网络(CNN)和Transformer模型)训练及评估模型。PsOCR涵盖了1000种独特的字体家族、颜色、图像尺寸以及布局的变化。 从该数据集中选取了1万个样本作为基准子集来评估多种LMM的性能,其中包括七款开源模型:DeepSeek的Janus、InternVL、MiniCPM、Florence和Qwen(3B和7B),以及四款闭源模型:GPT-4o、Gemini、Claude和Grok。实验结果显示,在所有模型中,Gemini表现最佳;而在开源模型中,Qwen-7B尤为突出。 这项工作提供了当前LMM在普什图语OCR任务中的能力和局限性的深入评估,并为包括阿拉伯语、波斯语和乌尔都语在内的其他类似书写系统的进一步研究奠定了基础。PsOCR数据集可以在以下链接获取:[此URL]。
https://arxiv.org/abs/2505.10055