Multimodal Large Language Models (MLLMs) have shown strong performance in document image tasks, especially Optical Character Recognition (OCR). However, they struggle with Document Image Machine Translation (DIMT), which requires handling both cross-modal and cross-lingual challenges. Previous efforts to enhance DIMT capability through Supervised Fine-Tuning (SFT) on the DIMT dataset often result in the forgetting of the model's existing monolingual abilities, such as OCR. To address these challenges, we introduce a novel fine-tuning paradigm, named Synchronously Self-Reviewing (SSR) its OCR proficiency, inspired by the concept "Bilingual Cognitive Advantage". Specifically, SSR prompts the model to generate OCR text before producing translation text, which allows the model to leverage its strong monolingual OCR ability while learning to translate text across languages. Comprehensive experiments demonstrate the proposed SSR learning helps mitigate catastrophic forgetting, improving the generalization ability of MLLMs on both OCR and DIMT tasks.
多模态大型语言模型(MLLMs)在文档图像任务中表现出色,尤其是在光学字符识别(OCR)方面。然而,它们在处理文档图像机器翻译(DIMT)时遇到困难,因为这需要同时应对跨模式和跨语言的挑战。之前通过在DIMT数据集上进行监督微调(SFT)来增强DIMT能力的方法往往会遗忘模型原有的单语技能,如OCR。为了解决这些挑战,我们引入了一种新的微调范式——同步自审(SSR),旨在保持其OCR专业技能,灵感来自“双语认知优势”这一概念。具体来说,SSR提示模型在生成翻译文本之前先生成OCR文本,这使得模型能够利用其强大的单语言OCR能力同时学习跨语言的文本翻译。全面的实验表明,所提出的SSR学习有助于缓解灾难性遗忘,从而提高MLLMs在OCR和DIMT任务上的泛化能力。
https://arxiv.org/abs/2507.08309
Text recognition is significantly influenced by font types, especially for complex scripts like Khmer. The variety of Khmer fonts, each with its unique character structure, presents challenges for optical character recognition (OCR) systems. In this study, we evaluate the impact of 19 randomly selected Khmer font types on text recognition accuracy using Pytesseract. The fonts include Angkor, Battambang, Bayon, Bokor, Chenla, Dangrek, Freehand, Kh Kompong Chhnang, Kh SN Kampongsom, Khmer, Khmer CN Stueng Songke, Khmer Savuth Pen, Metal, Moul, Odor MeanChey, Preah Vihear, Siemreap, Sithi Manuss, and iSeth First. Our comparison of OCR performance across these fonts reveals that Khmer, Odor MeanChey, Siemreap, Sithi Manuss, and Battambang achieve high accuracy, while iSeth First, Bayon, and Dangrek perform poorly. This study underscores the critical importance of font selection in optimizing Khmer text recognition and provides valuable insights for developing more robust OCR systems.
文本识别在很大程度上受到字体类型的影响,尤其是在处理像高棉语这样复杂的脚本时。由于每种高棉语字体都有其独特的字符结构,这给光学字符识别(OCR)系统带来了挑战。在这项研究中,我们使用Pytesseract评估了19种随机选择的高棉语字体对文本识别准确率的影响。这些字体包括Angkor、Battambang、Bayon、Bokor、Chenla、Dangrek、Freehand、Kh Kompong Chhnang、Kh SN Kampongsom、Khmer、Khmer CN Stueng Songke、Khmer Savuth Pen、Metal、Moul、Odor MeanChey、Preah Vihear、Siemreap、Sithi Manuss和iSeth First。我们对这些字体的OCR性能进行了比较,结果显示Khmer、Odor MeanChey、Siemreap、Sithi Manuss以及Battambang字体实现了较高的准确率,而iSeth First、Bayon和Dangrek的表现较差。这项研究表明,在优化高棉语文本识别中选择合适的字体至关重要,并为开发更强大的OCR系统提供了宝贵的见解。
https://arxiv.org/abs/2506.23963
Exploring high-latitude lunar regions presents an extremely challenging visual environment for robots. The low sunlight elevation angle and minimal light scattering result in a visual field dominated by a high dynamic range featuring long, dynamic shadows. Reproducing these conditions on Earth requires sophisticated simulators and specialized facilities. We introduce a unique dataset recorded at the LunaLab from the SnT - University of Luxembourg, an indoor test facility designed to replicate the optical characteristics of multiple lunar latitudes. Our dataset includes images, inertial measurements, and wheel odometry data from robots navigating seven distinct trajectories under multiple illumination scenarios, simulating high-latitude lunar conditions from dawn to night time with and without the aid of headlights, resulting in 88 distinct sequences containing a total of 1.3M images. Data was captured using a stereo RGB-inertial sensor, a monocular monochrome camera, and for the first time, a novel single-photon avalanche diode (SPAD) camera. We recorded both static and dynamic image sequences, with robots navigating at slow (5 cm/s) and fast (50 cm/s) speeds. All data is calibrated, synchronized, and timestamped, providing a valuable resource for validating perception tasks from vision-based autonomous navigation to scientific imaging for future lunar missions targeting high-latitude regions or those intended for robots operating across perceptually degraded environments. The dataset can be downloaded from this https URL, and a visual overview is available at this https URL. All supplementary material can be found at this https URL.
探索高纬度的月球区域对机器人来说是一个极具挑战性的视觉环境。由于低太阳高度角和少量光线散射,这些地区的视野中充满了宽广的动态阴影,导致光照范围极大。在地球上再现这种条件需要复杂的模拟器和专门设施。 我们介绍了一套独特的数据集,该数据集由卢森堡大学SnT的研究机构LunaLab室内测试设施记录而成。这个实验室旨在复制不同月球纬度的光学特性。我们的数据集包含了机器人在七条不同的轨迹上,在多种光照条件下导航时所拍摄的图像、惯性测量值和轮式里程计数据,模拟了从黎明到夜晚的不同高纬度月球条件,包括开启和关闭前照灯的情况,总共包含88种不同的序列,总共有130万张图片。这些数据使用了一套立体RGB惯性传感器、单色相机以及首次使用的新型单光子雪崩二极管(SPAD)摄像头进行采集。 我们记录了静态和动态的图像序列,在慢速(5厘米/秒)和快速(50厘米/秒)的速度下,机器人在导航过程中的表现。所有数据都经过校准、同步并添加时间戳,为从基于视觉的自主导航任务到未来针对高纬度地区或感知环境恶劣条件下的机器人的科学成像等验证提供了宝贵的资源。 该数据集可以从以下链接下载:[数据集下载链接](请将方括号内的文本替换为实际URL)。 可视化概览可以在以下链接查看:[视觉概览链接] (请将方括号内的文本替换为实际URL)。 所有补充材料可以在以下链接找到:[补充材料链接](请将方括号内的文本替换为实际URL)。
https://arxiv.org/abs/2506.22956
Amid a tidal wave of misinformation flooding social media during elections and crises, extensive research has been conducted on misinformation detection, primarily focusing on text-based or image-based approaches. However, only a few studies have explored multimodal feature combinations, such as integrating text and images for building a classification model to detect misinformation. This study investigates the effectiveness of different multimodal feature combinations, incorporating text, images, and social features using an early fusion approach for the classification model. This study analyzed 1,529 tweets containing both text and images during the COVID-19 pandemic and election periods collected from Twitter (now X). A data enrichment process was applied to extract additional social features, as well as visual features, through techniques such as object detection and optical character recognition (OCR). The results show that combining unsupervised and supervised machine learning models improves classification performance by 15% compared to unimodal models and by 5% compared to bimodal models. Additionally, the study analyzes the propagation patterns of misinformation based on the characteristics of misinformation tweets and the users who disseminate them.
在选举和危机期间,随着社交媒体上错误信息泛滥成灾,已经进行了大量关于错误信息检测的研究,主要集中在基于文本或图像的方法。然而,只有少数研究探讨了多模态特征组合的使用,例如将文本与图像相结合以构建用于检测错误信息的分类模型。这项研究调查了不同多模态特征组合的有效性,通过早期融合方法在分类模型中结合文本、图片和社会特征进行分析。该研究分析了1,529条包含文本和图片的推特,在新冠疫情期间以及选举期间从Twitter(现为X)收集的数据。 为了提取额外的社会特征及视觉特征,采用数据增强过程,并应用如对象检测和光学字符识别(OCR)等技术。结果显示,结合无监督和有监督机器学习模型能比单一模态模型提高15%的分类性能,相较于双模态模型也有5%的提升。此外,该研究还根据错误信息推文的特征及其传播者的特点分析了错误信息的传播模式。
https://arxiv.org/abs/2507.01984
In this paper, we present an Optical Character Recognition (OCR) system specifically designed for the accurate recognition and digitization of Greek polytonic texts. By leveraging the combined strengths of convolutional layers for feature extraction and recurrent layers for sequence learning, our system addresses the unique challenges posed by Greek polytonic scripts. This approach aims to overcome the limitations of traditional OCR methods, offering significant improvements in accuracy and efficiency. We release the underlying model as an open-source library and make our OCR platform available for academic use.
在这篇论文中,我们提出了一种专门用于准确识别和数字化希腊多音调文本的光学字符识别(OCR)系统。通过利用卷积层进行特征提取和递归层进行序列学习的综合优势,我们的系统解决了由希腊多音调脚本带来的独特挑战。这种方法旨在克服传统OCR方法的局限性,在准确性和效率方面提供了显著改进。我们将基础模型作为开源库发布,并将我们的OCR平台提供给学术界使用。
https://arxiv.org/abs/2506.21474
We developed a proof-of-concept method for the automatic analysis of the structure and content of incunabula pages. A custom dataset comprising 500 annotated pages from five different incunabula was created using resources from the Jagiellonian Digital Library. Each page was manually labeled with five predefined classes: Text, Title, Picture, Table, and Handwriting. Additionally, the publicly available DocLayNet dataset was utilized as supplementary training data. To perform object detection, YOLO11n and YOLO11s models were employed and trained using two strategies: a combined dataset (DocLayNet and the custom dataset) and the custom dataset alone. The highest performance (F1 = 0.94) was achieved by the YOLO11n model trained exclusively on the custom data. Optical character recognition was then conducted on regions classified as Text, using both Tesseract and Kraken OCR, with Tesseract demonstrating superior results. Subsequently, image classification was applied to the Picture class using a ResNet18 model, achieving an accuracy of 98.7% across five subclasses: Decorative_letter, Illustration, Other, Stamp, and Wrong_detection. Furthermore, the CLIP model was utilized to generate semantic descriptions of illustrations. The results confirm the potential of machine learning in the analysis of early printed books, while emphasizing the need for further advancements in OCR performance and visual content interpretation.
我们开发了一种概念验证方法,用于自动分析古籍印刷品页面的结构和内容。使用来自雅盖隆数字图书馆(Jagiellonian Digital Library)的资源创建了一个包含500页注释数据集,这些注释来自于五本不同的古登堡圣经,每一页都被手动标注为五个预定义类别之一:文本、标题、图片、表格和手写。此外,还使用了公开可用的DocLayNet数据集作为辅助训练数据。 为了执行目标检测任务,我们采用了YOLO11n和YOLO11s模型,并通过两种策略对其进行训练:一种是结合使用的数据集(DocLayNet和自定义数据集),另一种则是仅使用自定义数据集。结果显示,专门在自定义数据集上进行训练的YOLO11n模型取得了最佳性能(F1分数为0.94)。 随后,我们对被分类为“文本”的区域进行了光学字符识别(OCR)。这项任务采用了Tesseract和Kraken两种OCR工具,其中Tesseract表现更优。接着,我们使用ResNet18模型对图片类别进行图像分类,在五种子类(装饰字母、插图、其他、印章以及错误检测)中实现了98.7%的准确率。最后,还利用了CLIP模型生成关于插图的语义描述。 这些结果证实了机器学习在分析早期印刷书籍方面的潜力,同时也强调了需要进一步提高OCR性能和视觉内容解读方面的发展需求。
https://arxiv.org/abs/2506.18069
We present an enhanced YOLOv8 real time vehicle detection and classification framework, for estimating carbon emissions in urban environments. The system enhances YOLOv8 architecture to detect, segment, and track vehicles from live traffic video streams. Once a vehicle is localized, a dedicated deep learning-based identification module is employed to recognize license plates and classify vehicle types. Since YOLOv8 lacks the built-in capacity for fine grained recognition tasks such as reading license plates or determining vehicle attributes beyond class labels, our framework incorporates a hybrid pipeline where each detected vehicle is tracked and its bounding box is cropped and passed to a deep Optical Character Recognition (OCR) module. This OCR system, composed of multiple convolutional neural network (CNN) layers, is trained specifically for character-level detection and license plate decoding under varied conditions such as motion blur, occlusion, and diverse font styles. Additionally, the recognized plate information is validated using a real time API that cross references with an external vehicle registration database to ensure accurate classification and emission estimation. This multi-stage approach enables precise, automated calculation of per vehicle carbon emissions. Extensive evaluation was conducted using a diverse vehicle dataset enriched with segmentation masks and annotated license plates. The YOLOv8 detector achieved a mean Average Precision (mAP@0.5) of approximately 71% for bounding boxes and 70% for segmentation masks. Character level OCR accuracy reached up to 99% with the best performing CNN model. These results affirm the feasibility of combining real time object detection with deep OCR for practical deployment in smart transportation systems, offering a scalable solution for automated, vehicle specific carbon emission monitoring.
我们提出了一种增强版的YOLOv8实时车辆检测和分类框架,用于在城市环境中估算碳排放。该系统改进了YOLOv8架构,能够从实时交通视频流中检测、分割并追踪车辆。一旦定位到一辆车,将使用一个专门基于深度学习的身份识别模块来识别车牌号码,并对车辆类型进行分类。由于YOLOv8本身缺乏读取车牌或确定超出类别标签的车辆属性等精细化任务的能力,我们的框架采用了一种混合流水线方法,在该方法中,每辆检测到的车辆都会被追踪并且其边界框会被裁剪并传递给一个深度光学字符识别(OCR)模块。这个OCR系统由多个卷积神经网络(CNN)层组成,并且特别针对在运动模糊、遮挡和不同字体样式等条件下进行字符级检测和车牌解码进行了训练。 此外,通过实时API验证所识读的车牌信息,该API与外部车辆注册数据库交叉参考以确保准确分类和排放估算。这种多阶段方法能够实现每辆车碳排放的精确自动化计算。使用一组多样化的带有分割掩模和注释车牌的数据集进行广泛的评估后发现,YOLOv8检测器在边界框方面达到了约71%的平均精度(mAP@0.5),而在分割掩模方面则为70%。字符级OCR准确率最高可达99%,这表明结合实时目标检测与深度OCR在智能交通系统中的实际部署是可行的,提供了一种针对自动化、车辆特定碳排放监测的可扩展解决方案。
https://arxiv.org/abs/2506.18924
Knowledge extraction through sound is a distinctive property. Visually impaired individuals often rely solely on Braille books and audio recordings provided by NGOs. Due to limitations in these approaches, blind individuals often cannot access books of their choice. Speech is a more effective mode of communication than text for blind and visually impaired persons, as they can easily respond to sounds. This paper presents the development of an accurate, reliable, cost-effective, and user-friendly optical character recognition (OCR)-based speech synthesis system. The OCR-based system has been implemented using Laboratory Virtual Instrument Engineering Workbench (LabVIEW).
通过声音进行知识提取是一种独特的特性。视障人士通常依赖于非政府组织提供的盲文书籍和音频录音,但由于这些方法的局限性,他们往往无法获取自己选择的书籍。对于盲人及视力受损者而言,语音比文本更有效的沟通方式,因为他们可以轻松地对声音做出回应。本文介绍了开发一种准确、可靠、成本效益高且用户友好的基于光学字符识别(OCR)的语音合成系统的进展。该OCR系统已使用实验室虚拟仪器工程工作台(LabVIEW)实现。
https://arxiv.org/abs/2506.15029
Tariff exemptions are fundamental to attracting Foreign Direct Investment (FDI) into the manufacturing sector, though the associated administrative processes present areas for optimization for both investing entities and the national tax authority. This paper proposes a conceptual framework to empower tax administration by leveraging a synergistic integration of Optical Character Recognition (OCR) and Large Language Model (LLM) technologies. The proposed system is designed to first utilize OCR for intelligent digitization, precisely extracting data from diverse application documents and key regulatory texts such as tariff orders. Subsequently, the LLM would enhance the capabilities of administrative officers by automating the critical and time-intensive task of verifying submitted HS Tariff Codes for machinery, equipment, and raw materials against official exemption lists. By enhancing the speed and precision of these initial assessments, this AI-driven approach systematically reduces potential for non-alignment and non-optimized exemption utilization, thereby streamlining the investment journey for FDI companies. For the national administration, the benefits include a significant boost in operational capacity, reduced administrative load, and a strengthened control environment, ultimately improving the ease of doing business and solidifying the nation's appeal as a premier destination for high-value manufacturing FDI.
关税豁免是吸引外国直接投资(FDI)进入制造业领域的关键因素,尽管与此相关的行政程序仍有许多优化空间。本文提出了一种概念框架,通过光学字符识别(OCR)技术和大型语言模型(LLM)技术的协同集成来增强税务管理能力。该系统首先利用OCR进行智能数字化,精确提取来自各种申请文件和重要法规文本(如关税命令)的数据。随后,LLM将通过自动化验证提交的HS关税代码(包括机械、设备及原材料)与官方豁免清单的过程,提升行政人员的能力。这不仅加快了初步评估的速度,还提高了准确性,从而系统地减少了潜在的非一致性问题和不优化的豁免使用情况,进而简化FDI公司的投资流程。 对于国家行政部门而言,该框架带来的益处包括显著提高操作能力、减少行政负担以及加强监管环境。最终,这一方法将改善营商环境,并巩固该国作为高端制造领域FDI首选目的地的地位。
https://arxiv.org/abs/2506.12093
Retrieving accurate details from documents is a crucial task, especially when handling a combination of scanned images and native digital formats. This document presents a combined framework for text extraction that merges Optical Character Recognition (OCR) techniques with Large Language Models (LLMs) to deliver structured outputs enriched by contextual understanding and confidence indicators. Scanned files are processed using OCR engines, while digital files are interpreted through layout-aware libraries. The extracted raw text is subsequently analyzed by an LLM to identify key-value pairs and resolve ambiguities. A comparative analysis of different OCR tools is presented to evaluate their effectiveness concerning accuracy, layout recognition, and processing speed. The approach demonstrates significant improvements over traditional rule-based and template-based methods, offering enhanced flexibility and semantic precision across different document categories
从文档中准确提取细节是一项至关重要的任务,特别是在处理扫描图像和原生数字格式的组合时。本文提出了一种结合框架,将光学字符识别(OCR)技术与大型语言模型(LLMs)相结合,以提供由上下文理解和置信度指标增强的结构化输出。扫描文件通过OCR引擎进行处理,而数字文件则通过布局感知库来解释。提取的原始文本随后由LLM分析,以识别键值对并解决歧义问题。本文还提供了不同OCR工具的比较分析,评估它们在准确性、布局识别和处理速度方面的效果。该方法相较于传统的基于规则的方法和基于模板的方法展示了显著改进,提供增强的灵活性和语义精确度,适用于不同的文档类别。
https://arxiv.org/abs/2506.11156
Single-image super-resolution refers to the reconstruction of a high-resolution image from a single low-resolution observation. Although recent deep learning-based methods have demonstrated notable success on simulated datasets -- with low-resolution images obtained by degrading and downsampling high-resolution ones -- they frequently fail to generalize to real-world settings, such as document scans, which are affected by complex degradations and semantic variability. In this study, we introduce a task-driven, multi-task learning framework for training a super-resolution network specifically optimized for optical character recognition tasks. We propose to incorporate auxiliary loss functions derived from high-level vision tasks, including text detection using the connectionist text proposal network, text recognition via a convolutional recurrent neural network, keypoints localization using this http URL, and hue consistency. To balance these diverse objectives, we employ dynamic weight averaging mechanism, which adaptively adjusts the relative importance of each loss term based on its convergence behavior. We validate our approach upon the SRResNet architecture, which is a well-established technique for single-image super-resolution. Experimental evaluations on both simulated and real-world scanned document datasets demonstrate that the proposed approach improves text detection, measured with intersection over union, while preserving overall image fidelity. These findings underscore the value of multi-objective optimization in super-resolution models for bridging the gap between simulated training regimes and practical deployment in real-world scenarios.
单图像超分辨率指的是从单一低分辨率观测中重建高分辨率图像的过程。尽管基于深度学习的方法在模拟数据集上取得了显著的成功——通过降级和下采样高质量的原始图像来获取低质量图像,但在实际场景如文档扫描中的应用效果却大打折扣。这些实际场景受到复杂的退化因素和语义变化的影响。为此,在这项研究中,我们介绍了一种任务驱动的多任务学习框架,用于训练专为光学字符识别(OCR)任务优化的超分辨率网络。我们的方法通过加入由高层次视觉任务导出的辅助损失函数来增强模型性能,包括使用连接主义文本提议网络进行文本检测、利用卷积递归神经网络实现文本识别、基于this http URL的方法进行关键点定位以及色相一致性。 为了平衡这些不同的目标,我们采用了一种动态权重平均机制,该机制根据每个损失项的收敛行为自适应地调整其相对重要性。我们在SRResNet架构上验证了这种方法的效果,这是一种广泛应用于单图像超分辨率处理的技术。实验结果显示,在模拟和真实世界的文档扫描数据集上的评估中,所提出的方案在提高文本检测准确率(用交并比衡量)的同时保持了整体图像的保真度。这些结果强调了多目标优化在超分辨率模型中的价值,特别是在将仿真训练与实际应用场景相结合时的作用。
https://arxiv.org/abs/2506.06953
Current smart glasses equipped with RGB cameras struggle to perceive the environment in low-light and high-speed motion scenarios due to motion blur and the limited dynamic range of frame cameras. Additionally, capturing dense images with a frame camera requires large bandwidth and power consumption, consequently draining the battery faster. These challenges are especially relevant for developing algorithms that can read text from images. In this work, we propose a novel event-based Optical Character Recognition (OCR) approach for smart glasses. By using the eye gaze of the user, we foveate the event stream to significantly reduce bandwidth by around 98% while exploiting the benefits of event cameras in high-dynamic and fast scenes. Our proposed method performs deep binary reconstruction trained on synthetic data and leverages multimodal LLMs for OCR, outperforming traditional OCR solutions. Our results demonstrate the ability to read text in low light environments where RGB cameras struggle while using up to 2400 times less bandwidth than a wearable RGB camera.
当前配备RGB摄像头的智能眼镜在低光和高速运动场景中感知环境的能力受限,因为这些情况下会出现运动模糊,并且帧相机的动态范围有限。此外,在使用帧相机捕获密集图像时,需要较大的带宽和能耗,从而导致电池消耗更快。这些问题尤其影响了开发能够从图像中读取文本的算法的工作。 在这项工作中,我们提出了一种基于事件的光学字符识别(OCR)方法来解决上述问题,该方法特别适用于智能眼镜。通过利用用户的视线,我们将事件流聚焦到用户注视的地方,这可以将带宽显著减少约98%,同时还能在高动态范围和快速变化场景中利用事件相机的优势。 我们提出的方法采用基于合成数据训练的深度二值重建技术,并利用多模态大语言模型进行OCR识别。这种方法不仅超过了传统的OCR解决方案,在低光环境中读取文本的能力上也表现出色,而且使用的带宽比可穿戴RGB相机少多达2400倍。
https://arxiv.org/abs/2506.06918
Large Language Models (LLMs) demonstrate varying performance across languages and cultural contexts. This study introduces a novel, culturally-rich, multilingual dataset derived from video recordings of the Romanian game show "Who Wants to Be a Millionaire?" (Vrei sÄ fii Milionar?). We employed an innovative process combining optical character recognition (OCR), automated text extraction, and manual verification to collect question-answer pairs, enriching them with metadata including question domain (e.g., biology, history), cultural relevance (Romanian-specific vs. international), and difficulty. Benchmarking state-of-the-art LLMs, including Romanian-adapted models, on this dataset revealed significant performance disparities: models consistently achieve higher accuracy (80-95%) on international questions compared to Romanian-specific cultural questions (50-75%). We further investigate these differences through experiments involving machine translation of Romanian questions into English and cross-lingual tests using a comparable dataset in French. Our findings underscore the impact of cultural context and data source on LLM performance and offer practical insights for building robust, culturally-aware multilingual NLP systems, especially in educational domains. The dataset is publicly available at Hugging Face.
大型语言模型(LLMs)在不同语言和文化背景下的表现存在差异。本研究引入了一个新颖且文化底蕴丰富的多语言数据集,该数据集源自罗马尼亚电视益智节目《谁想成为百万富翁?》(Vrei să fii milionar?)的视频记录。我们采用了一种创新的方法结合光学字符识别(OCR)、自动文本提取和人工验证来收集问题-答案对,并且通过添加包括问题领域(例如生物、历史)、文化相关性(特定于罗马尼亚的文化内容与国际通用的内容)以及难度等元数据,进一步丰富了这些数据。 在使用这一数据集评估当前最先进的LLM模型(包括经过罗马尼亚语适应的模型)时,我们发现显著的表现差距:所有模型在处理具有国际普遍性的题目时准确率较高(80-95%),而在面对特定于罗马尼亚文化的题目时则表现较差(50-75%)。为了进一步探讨这一差异,我们进行了实验,将罗马尼亚语问题通过机器翻译成英语,并使用与该数据集相似的法语文本进行跨语言测试。我们的发现强调了文化背景和数据来源对LLM性能的影响,并为构建稳健且具有文化意识的多语言自然语言处理(NLP)系统提供了实际见解,特别是在教育领域。 该数据集已在Hugging Face上公开发布。
https://arxiv.org/abs/2506.05991
Foundational to the Chinese language and culture, Chinese characters encompass extraordinarily extensive and ever-expanding categories, with the latest Chinese GB18030-2022 standard containing 87,887 categories. The accurate recognition of this vast number of characters, termed mega-category recognition, presents a formidable yet crucial challenge for cultural heritage preservation and digital applications. Despite significant advances in Optical Character Recognition (OCR), mega-category recognition remains unexplored due to the absence of comprehensive datasets, with the largest existing dataset containing merely 16,151 categories. To bridge this critical gap, we introduce MegaHan97K, a mega-category, large-scale dataset covering an unprecedented 97,455 categories of Chinese characters. Our work offers three major contributions: (1) MegaHan97K is the first dataset to fully support the latest GB18030-2022 standard, providing at least six times more categories than existing datasets; (2) It effectively addresses the long-tail distribution problem by providing balanced samples across all categories through its three distinct subsets: handwritten, historical and synthetic subsets; (3) Comprehensive benchmarking experiments reveal new challenges in mega-category scenarios, including increased storage demands, morphologically similar character recognition, and zero-shot learning difficulties, while also unlocking substantial opportunities for future research. To the best of our knowledge, the MetaHan97K is likely the dataset with the largest classes not only in the field of OCR but may also in the broader domain of pattern recognition. The dataset is available at this https URL.
中文语言和文化的基础是中国汉字,它们包含极其广泛且不断扩大的类别。最新的《中华人民共和国国家标准GB18030-2022》包含了多达87,887个字符类别。准确识别这些海量字符,即所谓的“大规模类别识别”,对文化遗产保护及数字应用提出了一个艰巨但至关重要的挑战。尽管光学字符识别(OCR)技术取得了显著进展,但由于缺乏全面的数据集支持,大规模类别识别尚未被深入探索,目前已知的最大数据集仅包含16,151个类别。为填补这一关键空白,我们推出了MegaHan97K,这是一个涵盖前所未有的97,455个汉字类别的大规模数据集。 我们的工作做出了三大主要贡献: (1)MegaHan97K是首个全面支持最新GB18030-2022标准的数据集,并且其类别数量至少为现有数据集的六倍; (2)通过提供包括手写、历史和合成三种不同子集在内的平衡样本,它有效解决了长尾分布问题; (3)全面基准测试实验揭示了大规模场景下的新挑战,如增加的存储需求、形态相似字符识别难度以及零样本学习障碍,并为未来研究开辟了巨大机会。 据我们所知,MetaHan97K可能是OCR领域乃至更广泛的模式识别领域的类别最多的数据集。该数据集可在以下网址获取:[https://example.com](请将链接替换为您实际的数据集发布页面)。
https://arxiv.org/abs/2506.04807
This paper presents the first study on adapting the visual in-context learning (V-ICL) paradigm to optical character recognition tasks, specifically focusing on text removal and segmentation. Most existing V-ICL generalists employ a reasoning-as-reconstruction approach: they turn to using a straightforward image-label compositor as the prompt and query input, and then masking the query label to generate the desired output. This direct prompt confines the model to a challenging single-step reasoning process. To address this, we propose a task-chaining compositor in the form of image-removal-segmentation, providing an enhanced prompt that elicits reasoning with enriched intermediates. Additionally, we introduce context-aware aggregation, integrating the chained prompt pattern into the latent query representation, thereby strengthening the model's in-context reasoning. We also consider the issue of visual heterogeneity, which complicates the selection of homogeneous demonstrations in text recognition. Accordingly, this is effectively addressed through a simple self-prompting strategy, preventing the model's in-context learnability from devolving into specialist-like, context-free inference. Collectively, these insights culminate in our ConText model, which achieves new state-of-the-art across both in- and out-of-domain benchmarks. The code is available at this https URL.
本文介绍了首个将视觉上下文学习(V-ICL)范式应用于光学字符识别任务的研究,具体聚焦于文本去除和分割。大多数现有的通用型V-ICL模型采用的是重构推理方法:它们使用简单的图像标签合成器作为提示和查询输入,然后通过屏蔽查询标签来生成所需输出。这种直接的提示将模型限制在一个具有挑战性的单步推理过程中。为了解决这一问题,我们提出了一种以“图像去除分割”形式的任务链式合成器,提供了增强的提示,使模型能够基于更加丰富的中间结果进行推理。 此外,我们还引入了上下文感知聚合方法,将链条式的提示模式整合到潜在查询表示中,从而增强了模型在上下文中进行推理的能力。同时,我们也考虑到了视觉异质性的问题,这使得选择同质化的演示示例在文本识别中变得复杂化。为此,通过一个简单的自我提示策略有效解决了这一问题,防止了模型的上下文学习能力退化为缺乏背景信息的专家式推断。 综上所述,这些见解汇聚成了我们的ConText模型,在域内和跨域基准测试中均达到了新的最先进水平。代码可在以下链接获取:[此URL]。
https://arxiv.org/abs/2506.03799
Visually impaired individuals face significant challenges navigating and interacting with unknown situations, particularly in tasks requiring spatial awareness and semantic scene understanding. To accelerate the development and evaluate the state of technologies that enable visually impaired people to solve these tasks, the Vision Assistance Race (VIS) at the Cybathlon 2024 competition was organized. In this work, we present Sight Guide, a wearable assistive system designed for the VIS. The system processes data from multiple RGB and depth cameras on an embedded computer that guides the user through complex, real-world-inspired tasks using vibration signals and audio commands. Our software architecture integrates classical robotics algorithms with learning-based approaches to enable capabilities such as obstacle avoidance, object detection, optical character recognition, and touchscreen interaction. In a testing environment, Sight Guide achieved a 95.7% task success rate, and further demonstrated its effectiveness during the Cybathlon competition. This work provides detailed insights into the system design, evaluation results, and lessons learned, and outlines directions towards a broader real-world applicability.
视力障碍者在面对未知环境时面临着重大挑战,尤其是在需要空间意识和语义场景理解的任务中。为了加速开发并评估帮助视障人士解决此类任务的技术状态,Cybathlon 2024 竞赛组织了视觉辅助竞赛(VIS)。在此项工作中,我们介绍了 Sight Guide,这是一个为 VIS 设计的可穿戴助行系统。该系统利用嵌入式计算机处理来自多个 RGB 和深度相机的数据,并通过振动信号和音频指令引导用户完成复杂的、基于现实世界的任务。我们的软件架构结合了经典机器人算法与基于学习的方法,以实现障碍物规避、物体检测、光学字符识别以及触摸屏交互等能力。在测试环境中,Sight Guide 实现了 95.7% 的任务成功率,并且在 Cybathlon 竞赛中进一步证明了其实用性。这项工作详细介绍了系统设计、评估结果和经验教训,并指出了迈向更广泛现实世界应用的方向。
https://arxiv.org/abs/2506.02676
The inherent complexities of Arabic script; its cursive nature, diacritical marks (tashkeel), and varied typography, pose persistent challenges for Optical Character Recognition (OCR). We present Qari-OCR, a series of vision-language models derived from Qwen2-VL-2B-Instruct, progressively optimized for Arabic through iterative fine-tuning on specialized synthetic datasets. Our leading model, QARI v0.2, establishes a new open-source state-of-the-art with a Word Error Rate (WER) of 0.160, Character Error Rate (CER) of 0.061, and BLEU score of 0.737 on diacritically-rich texts. Qari-OCR demonstrates superior handling of tashkeel, diverse fonts, and document layouts, alongside impressive performance on low-resolution images. Further explorations (QARI v0.3) showcase strong potential for structural document understanding and handwritten text. This work delivers a marked improvement in Arabic OCR accuracy and efficiency, with all models and datasets released to foster further research.
阿拉伯文脚本的内在复杂性,包括其连笔特性、标点符号(tashkeel)和多样的字体类型,给光学字符识别(OCR)带来了持续的挑战。我们介绍了Qari-OCR,一系列基于Qwen2-VL-2B-Instruct的语言视觉模型,通过在专门合成的数据集上的迭代微调,这些模型逐渐优化以适应阿拉伯文。我们的领先模型QARI v0.2,在标有点缀符号丰富的文本上达到了新的开源最先进水平,其词错误率(WER)为0.160,字符错误率(CER)为0.061,BLEU得分为0.737。Qari-OCR展示了处理tashkeel、多种字体和文档布局的卓越能力,并在低分辨率图像上也表现出色。进一步的研究成果(QARI v0.3)展现了对结构化文档理解和手写文本的强大潜力。这项工作显著提升了阿拉伯文OCR的准确性和效率,所有模型和数据集均已开源以促进进一步研究。
https://arxiv.org/abs/2506.02295
Kwak'wala is an Indigenous language spoken in British Columbia, with a rich legacy of published documentation spanning more than a century, and an active community of speakers, teachers, and learners engaged in language revitalization. Over 11 volumes of the earliest texts created during the collaboration between Franz Boas and George Hunt have been scanned but remain unreadable by machines. Complete digitization through optical character recognition has the potential to facilitate transliteration into modern orthographies and the creation of other language technologies. In this paper, we apply the latest OCR techniques to a series of Kwak'wala texts only accessible as images, and discuss the challenges and unique adaptations necessary to make such technologies work for these real-world texts. Building on previous methods, we propose using a mix of off-the-shelf OCR methods, language identification, and masking to effectively isolate Kwak'wala text, along with post-correction models, to produce a final high-quality transcription.
克瓦卡利语(Kwak'wala)是一种在不列颠哥伦比亚省使用的土著语言,拥有超过一个世纪的出版文献传统,并且有一个活跃的语言复兴社区,包括说者、教师和学习者。最早由弗朗茨·鲍亚斯与乔治·亨特合作创作的11卷文本已经被扫描,但机器仍无法读取这些资料。通过光学字符识别进行完整的数字化有可能促进现代正字法下的转写,并创建其他语言技术。在本文中,我们应用最新的OCR(光学字符识别)技术来处理一系列仅以图像形式存在的克瓦卡利语文本,并讨论将此类技术应用于实际文本时面临的挑战及必要的独特调整方法。基于以往的方法,我们建议结合现成的OCR方法、语言识别和屏蔽技术有效分离出克瓦卡利语文本,并使用后期校正模型生成高质量的转录版本。
https://arxiv.org/abs/2506.01775
Arabic Optical Character Recognition (OCR) is essential for converting vast amounts of Arabic print media into digital formats. However, training modern OCR models, especially powerful vision-language models, is hampered by the lack of large, diverse, and well-structured datasets that mimic real-world book layouts. Existing Arabic OCR datasets often focus on isolated words or lines or are limited in scale, typographic variety, or structural complexity found in books. To address this significant gap, we introduce SARD (Large-Scale Synthetic Arabic OCR Dataset). SARD is a massive, synthetically generated dataset specifically designed to simulate book-style documents. It comprises 843,622 document images containing 690 million words, rendered across ten distinct Arabic fonts to ensure broad typographic coverage. Unlike datasets derived from scanned documents, SARD is free from real-world noise and distortions, offering a clean and controlled environment for model training. Its synthetic nature provides unparalleled scalability and allows for precise control over layout and content variation. We detail the dataset's composition and generation process and provide benchmark results for several OCR models, including traditional and deep learning approaches, highlighting the challenges and opportunities presented by this dataset. SARD serves as a valuable resource for developing and evaluating robust OCR and vision-language models capable of processing diverse Arabic book-style texts.
阿拉伯光学字符识别(OCR)对于将大量阿拉伯印刷媒体转换为数字格式至关重要。然而,训练现代OCR模型,尤其是强大的视觉语言模型,由于缺乏大规模、多样且结构良好的数据集来模拟真实世界的书籍布局而受到阻碍。现有的阿拉伯语OCR数据集往往集中于孤立的单词或行,或者在规模、字体变化或书籍中发现的结构复杂性方面存在局限性。为了解决这一重要缺口,我们介绍了SARD(大规模合成阿拉伯语OCR数据集)。SARD是一个巨大的、由合成生成的数据集,专门设计用于模拟书本风格的文档。它包含843,622张文件图像,其中包含了以十种不同的阿拉伯字体呈现的6.9亿个单词,以确保广泛的排版覆盖率。与从扫描文档中提取的数据集不同,SARD不受现实世界噪声和失真的影响,为模型训练提供了干净且受控的环境。其合成特性提供了前所未有的可扩展性,并允许对布局和内容变化进行精确控制。我们详细介绍了该数据集的组成及生成过程,并提供了几种OCR模型(包括传统方法和深度学习方法)的基准测试结果,突出展示了此数据集带来的挑战与机遇。SARD为开发和评估能够处理多样化阿拉伯书本风格文本的强大OCR和视觉语言模型提供了宝贵的资源。
https://arxiv.org/abs/2505.24600
Handwritten Mathematical Expression Recognition (HMER) remains a persistent challenge in Optical Character Recognition (OCR) due to the inherent freedom of symbol layout and variability in handwriting styles. Prior methods have faced performance bottlenecks, proposing isolated architectural modifications that are difficult to integrate coherently into a unified framework. Meanwhile, recent advances in pretrained vision-language models (VLMs) have demonstrated strong cross-task generalization, offering a promising foundation for developing unified solutions. In this paper, we introduce Uni-MuMER, which fully fine-tunes a VLM for the HMER task without modifying its architecture, effectively injecting domain-specific knowledge into a generalist framework. Our method integrates three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structured spatial reasoning, Error-Driven Learning (EDL) for reducing confusion among visually similar characters, and Symbol Counting (SC) for improving recognition consistency in long expressions. Experiments on the CROHME and HME100K datasets show that Uni-MuMER achieves new state-of-the-art performance, surpassing the best lightweight specialized model SSAN by 16.31% and the top-performing VLM Gemini2.5-flash by 24.42% in the zero-shot setting. Our datasets, models, and code are open-sourced at: this https URL
手写数学表达式识别(HMER)在光学字符识别(OCR)中仍然是一项持续的挑战,这是由于符号布局的自由度和书写风格的变化所致。先前的方法由于提出孤立的架构修改而遇到了性能瓶颈,这些修改难以整合到一个统一框架内。与此同时,近期预训练视觉语言模型(VLMs)在跨任务泛化方面表现出色,为开发统一体解决方案提供了有前景的基础。本文中,我们介绍了一种名为Uni-MuMER的方法,该方法无需修改其架构即可对VLM进行全面微调,从而有效地将领域特定知识注入到通用框架中。我们的方法整合了三项数据驱动的任务:结构化空间推理的树感知链式思维(Tree-CoT)、减少视觉相似字符混淆的错误驱动学习(EDL)以及提高长表达式识别一致性的符号计数(SC)。在CROHME和HME100K数据集上的实验表明,Uni-MuMER达到了新的最先进的性能,在零样本设置下,其表现优于最轻量级的专业模型SSAN 16.31%,优于顶级的VLM Gemini2.5-flash 24.42%。 我们的数据集、模型和代码已开源:[请在此处插入链接](请注意原文中的具体网址无法直接显示,请访问原论文获取准确的URL)。
https://arxiv.org/abs/2505.23566