We present Multimodal OCR (MOCR), a document parsing paradigm that jointly parses text and graphics into unified textual representations. Unlike conventional OCR systems that focus on text recognition and leave graphical regions as cropped pixels, our method, termed this http URL, treats visual elements such as charts, diagrams, tables, and icons as first-class parsing targets, enabling systems to parse documents while preserving semantic relationships across elements. It offers several advantages: (1) it reconstructs both text and graphics as structured outputs, enabling more faithful document reconstruction; (2) it supports end-to-end training over heterogeneous document elements, allowing models to exploit semantic relations between textual and visual components; and (3) it converts previously discarded graphics into reusable code-level supervision, unlocking multimodal supervision embedded in existing documents. To make this paradigm practical at scale, we build a comprehensive data engine from PDFs, rendered webpages, and native SVG assets, and train a compact 3B-parameter model through staged pretraining and supervised fine-tuning. We evaluate this http URL from two perspectives: document parsing and structured graphics parsing. On document parsing benchmarks, it ranks second only to Gemini 3 Pro on our OCR Arena Elo leaderboard, surpasses existing open-source document parsing systems, and sets a new state of the art of 83.9 on olmOCR Bench. On structured graphics parsing, this http URL achieves higher reconstruction quality than Gemini 3 Pro across image-to-SVG benchmarks, demonstrating strong performance on charts, UI layouts, scientific figures, and chemical diagrams. These results show a scalable path toward building large-scale image-to-code corpora for multimodal pretraining. Code and models are publicly available at this https URL.
https://arxiv.org/abs/2603.13032
Integrating Large Language Models (LLMs) into business process management tools promises to democratize Business Process Model and Notation (BPMN) modeling for non-experts. While automated frameworks assess syntactic and semantic quality, they miss human factors like trust, usability, and professional alignment. We conducted a mixed-methods evaluation of our proposed solution, an LLM-powered BPMN copilot, with five process modeling experts using focus groups and standardized questionnaires. Our findings reveal a critical tension between acceptable perceived usability (mean CUQ score: 67.2/100) and notably lower trust (mean score: 48.8\%), with reliability rated as the most critical concern (M=1.8/5). Furthermore, we identified output-quality issues, prompting difficulties, and a need for the LLM to ask more in-depth clarifying questions about the process. We envision five use cases ranging from domain-expert support to enterprise quality assurance. We demonstrate the necessity of human-centered evaluation complementing automated benchmarking for LLM modeling agents.
https://arxiv.org/abs/2603.12895
With the growing number and diversity of Vision-Language Models (VLMs), many works explore language-based ensemble, collaboration, and routing techniques across multiple VLMs to improve multi-model reasoning. In contrast, we address the diverse model selection using both vision and language modalities. We introduce focal error diversity to capture complementary reasoning across VLMs and a CKA-based focal diversity metric (CKA-focal) to measure disagreement in their visual embeddings. On the constructed ensemble surface from a pool of candidate VLMs, we applied a Genetic Algorithm to effectively prune out those component VLMs that do not add value to the fusion performance. We identify the best combination for each task as well as fuse the outputs of each VLMs in the model pool, and show that heterogeneous models can capture epistemic uncertainty dynamically and mitigate hallucinations. Our V3Fusion approach is capable of producing dual focal-diversity fused predictions with high performance for vision-language reasoning, even when there is no majority consensus or the majority of VLMs make incorrect predictions. Extensive experiments validate V3Fusion on four popular VLM benchmarks (A-OKVQA, MMMU, MMMU-Pro, and OCR-VQA). The results show that V3Fusion outperforms the best-performing VLM on MMMU by 8.09% and MMMU-Pro by 4.87% gain in accuracy. For generative tasks, V3Fusion outperforms Intern-VL2-8b and Qwen2.5-VL-7b, the top-2 VLM performers on both A-OKVQA and OCR-VQA. Our code and datasets are available at this https URL.
https://arxiv.org/abs/2603.12669
Recent visual-text compression (VTC) methods, typified by DeepSeek-OCR, report impressive high token compression ratios for long-context modeling tasks by leveraging text-to-image rendering. However, existing evaluation protocols heavily rely on downstream task performance. Such evaluation metrics fail to accurately measure text preservation due to the strong inherent linguistic priors of Multimodal Large Language Models (MLLMs). In this work, we introduce a new evaluation framework that decouples MLLMs' capabilities to faithfully assess VTC quality. Within this framework, we further introduce the ZeroSense Benchmark to ensure low semantic correlation of testing samples. By eliminating contextual dependencies, our benchmark guarantees that the evaluation results are purely reflective of VTC quality, unaffected by the semantic inference capabilities of downstream models. Extensive experiments across multiple datasets demonstrate that VTC quality and downstream task accuracy diverge significantly, highlighting the necessity of our decoupled evaluation framework.
最近的视觉-文本压缩(VTC)方法,例如DeepSeek-OCR,在长上下文建模任务中通过利用文本到图像的渲染技术报告了令人印象深刻的高符号压缩比。然而,现有的评估协议主要依赖于下游任务性能指标。这些评价标准无法准确衡量由于多模态大型语言模型(MLLMs)固有的强大语言先验知识而导致的文字保留情况。 在本工作中,我们引入了一个新的评估框架,该框架能够解耦MLLMs的能力以真实地评估VTC的质量。在此框架内,我们进一步推出了ZeroSense基准测试,确保了测试样本具有低语义相关性。通过消除上下文依赖关系,我们的基准测试保证了评价结果纯粹反映了VTC质量,不受下游模型语义推理能力的影响。 在多个数据集上进行的广泛实验表明,VTC质量和下游任务准确性显著偏离,这突显了解耦评估框架的重要性。
https://arxiv.org/abs/2603.11846
We present an agentic AI framework for autonomous multimodal query processing that coordinates specialized tools across text, image, audio, video, and document modalities. A central Supervisor dynamically decomposes user queries, delegates subtasks to modality-appropriate tools (e.g., object detection, OCR, speech transcription), and synthesizes results through adaptive routing strategies rather than predetermined decision trees. For text-only queries, the framework uses learned routing via RouteLLM, while non-text paths use SLM-assisted modality decomposition. Evaluated on 2,847 queries across 15 task categories, our framework achieves 72% reduction in time-to-accurate-answer, 85% reduction in conversational rework, and 67% cost reduction compared to the matched hierarchical baseline while maintaining accuracy parity. These results demonstrate that intelligent centralized orchestration fundamentally improves multimodal AI deployment economics.
我们提出了一种代理型人工智能框架,用于自主多模态查询处理。该框架能够协调文本、图像、音频、视频和文档等不同模态的专用工具。一个中心化的监督器(Supervisor)会根据用户的请求动态地分解任务,并将子任务委派给适当的模态工具(例如对象检测、OCR文字识别、语音转录等),并通过自适应路由策略而不是预设的决策树来综合结果。 对于仅包含文本的查询,该框架使用通过RouteLLM学习到的路由方式;而非文本路径则利用SLM辅助进行模态分解。在对2,847个跨15类任务的查询进行了评估后,我们的框架相比匹配的分层基线方法,在准确答案的时间交付上减少了72%,对话重做的次数减少了85%,成本降低了67%,同时保持了相同的准确性。这些结果表明,智能集中的协调方式在多模态人工智能部署中从根本上改善了经济效益。
https://arxiv.org/abs/2603.11545
Text-driven video generation has democratized film creation, but camera control in cinematic multi-shot scenarios remains a significant block. Implicit textual prompts lack precision, while explicit trajectory conditioning imposes prohibitive manual overhead and often triggers execution failures in current models. To overcome this bottleneck, we propose a data-centric paradigm shift, positing that aligned (Caption, Trajectory, Video) triplets form an inherent joint distribution that can connect automated plotting and precise execution. Guided by this insight, we present ShotVerse, a "Plan-then-Control" framework that decouples generation into two collaborative agents: a VLM (Vision-Language Model)-based Planner that leverages spatial priors to obtain cinematic, globally aligned trajectories from text, and a Controller that renders these trajectories into multi-shot video content via a camera adapter. Central to our approach is the construction of a data foundation: we design an automated multi-shot camera calibration pipeline aligns disjoint single-shot trajectories into a unified global coordinate system. This facilitates the curation of ShotVerse-Bench, a high-fidelity cinematic dataset with a three-track evaluation protocol that serves as the bedrock for our framework. Extensive experiments demonstrate that ShotVerse effectively bridges the gap between unreliable textual control and labor-intensive manual plotting, achieving superior cinematic aesthetics and generating multi-shot videos that are both camera-accurate and cross-shot consistent.
基于文本的视频生成已经让电影创作变得普及,但在复杂的多镜头场景中控制摄像机仍然是一个重大挑战。隐式文本提示缺乏精确度,而显式的轨迹条件设置则需要大量的手动工作,并且常常导致当前模型执行失败。为了克服这一瓶颈,我们提出了一种以数据为中心的方法转变,认为(说明文字、轨迹、视频)三元组形成了内在的联合分布,这可以连接自动情节构建和精确执行。基于此见解,我们提出了ShotVerse,“计划-然后控制”的框架,将生成过程分解为两个协作代理:一个基于视觉语言模型(VLM)的规划器,它利用空间先验知识从文本中获取电影级、全局对齐的轨迹,以及一个控制器,通过相机适配器将这些轨迹渲染成多镜头视频内容。我们方法的核心是数据基础构建:我们设计了一个自动化的多镜头摄像机校准流水线,将分散的单镜头轨迹对齐到统一的全局坐标系统中。这促进了ShotVerse-Bench的整理,这是一个具有三轨评估协议的高保真电影级数据集,作为我们框架的基础。广泛的实验表明,ShotVerse有效地弥合了不可靠的文字控制和劳动密集的手动布局之间的差距,实现了更优的电影美学,并生成了既准确又跨镜头一致的多镜头视频。
https://arxiv.org/abs/2603.11421
In this paper, we propose Agentar-Fin-OCR, a document parsing system tailored to financial-domain documents, transforming ultra-long financial PDFs into semantically consistent, highly accurate, structured outputs with auditing-grade provenance. To address finance-specific challenges such as complex layouts, cross-page structural discontinuities, and cell-level referencing capability, Agentar-Fin-OCR combines (1) a Cross-page Contents Consolidation algorithm to restore continuity across pages and a Document-level Heading Hierarchy Reconstruction (DHR) module to build a globally consistent Table of Contents (TOC) tree for structure-aware retrieval, and (2) a difficulty-adaptive curriculum learning training strategy for table parsing, together with a CellBBoxRegressor module that uses structural anchor tokens to localize table cells from decoder hidden states without external detectors. Experiments demonstrate that our model shows high performance on the table parsing metrics of OmniDocBench. To enable realistic evaluation in the financial vertical, we further introduce FinDocBench, a benchmark that includes six financial document categories with expert-verified annotations and evaluation metrics including Table of Contents edit-distance-based similarity (TocEDS), cross-page concatenated TEDS, and Table Cell Intersection over Union (C-IoU). We evaluate a wide range of state-of-the-art models on FinDocBench to assess their capabilities and remaining limitations on financial documents. Overall, Agentar-Fin-OCR and FinDocBench provide a practical foundation for reliable downstream financial document applications.
在这篇论文中,我们提出了Agentar-Fin-OCR系统,这是一个专为金融领域文档设计的文档解析系统。该系统能够将超长的财务PDF文件转换成语义一致、高度准确且结构化的输出,并具备审计级别的来源追溯能力。为了应对金融领域的特定挑战,如复杂布局、跨页结构断续以及单元格级引用功能等,Agentar-Fin-OCR结合了以下技术: 1. 一种跨页内容整合算法(Cross-page Contents Consolidation),用于恢复页面间的连续性,并且一个文档级别标题层级重构模块(DHR),可以构建全局一致的目录树(TOC)以进行结构感知检索。 2. 针对表格解析采用适应难度的课程学习训练策略,以及CellBBoxRegressor模块,该模块使用结构性锚点标记(token)来从解码器隐藏状态中定位表单元格,无需外部检测器。 实验表明,我们的模型在OmniDocBench表格解析指标上表现出色。为了实现对金融垂直领域的现实评估,我们进一步引入了FinDocBench基准测试,该测试包含了六类财务文档类别和专家验证的标注及评价标准,包括基于目录相似度(TocEDS)、跨页连接TEDS以及表单元格交集覆盖(C-IoU)等指标。我们在FinDocBench上评估了一系列最先进模型的能力及其在金融文档上的局限性。 总体而言,Agentar-Fin-OCR和FinDocBench为可靠的下游财务文档应用提供了实际的基础。
https://arxiv.org/abs/2603.11044
Accurate and early detection of oral cancer lesions is crucial for effective diagnosis and treatment. This study evaluates two RPA implementations, OC-RPAv1 and OC-RPAv2, using a test set of 31 images. OC-RPAv1 processes one image per prediction in an average of 0.29 seconds, while OCRPAv2 employs a Singleton design pattern and batch processing, reducing prediction time to just 0.06 seconds per image. This represents a 60-100x efficiency improvement over standard RPA methods, showcasing that design patterns and batch processing can enhance scalability and reduce costs in oral cancer detection
口腔癌病变的精确和早期检测对于有效诊断和治疗至关重要。本研究评估了两种RPA(机器人流程自动化)实现方案,即OC-RPAv1和OC-RPAv2,在包含31张图像的数据集上进行了测试。OC-RPAv1每次预测处理一张图片,平均耗时0.29秒;而OC-RPAv2采用了单例设计模式并使用批处理技术,将每张图片的预测时间缩短至仅需0.06秒。这相较于传统的RPA方法提高了60到100倍的效率,表明了设计模式和批处理可以在口腔癌检测中提升可扩展性和降低成本。
https://arxiv.org/abs/2603.10928
GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance between computational efficiency and recognition performance. To address the inefficiency of standard autoregressive decoding in deterministic OCR tasks, GLM-OCR introduces a Multi-Token Prediction (MTP) mechanism that predicts multiple tokens per step, significantly improving decoding throughput while keeping memory overhead low through shared parameters. At the system level, a two-stage pipeline is adopted: PP-DocLayout-V3 first performs layout analysis, followed by parallel region-level recognition. Extensive evaluations on public benchmarks and industrial scenarios show that GLM-OCR achieves competitive or state-of-the-art performance in document parsing, text and formula transcription, table structure recovery, and key information extraction. Its compact architecture and structured generation make it suitable for both resource-constrained edge deployment and large-scale production systems.
GLM-OCR 是一个高效且轻量级的多模态模型,具有0.9B参数,专门设计用于理解现实世界中的文档。该模型结合了0.4B参数的视觉编码器CogViT和0.5B参数的语言解码器GLM,达到了计算效率与识别性能之间的良好平衡。为了应对标准自回归解码在确定性OCR任务中效率低下的问题,GLM-OCR 引入了一种多令牌预测(MTP)机制,该机制可以在每一步预测多个令牌,从而显著提高解码吞吐量,并通过共享参数保持较低的内存开销。 在系统层面,采用了两阶段管道:首先使用 PP-DocLayout-V3 进行版面分析,然后进行并行区域级别的识别。通过对公开基准测试和工业场景的广泛评估,GLM-OCR 在文档解析、文本及公式转录、表格结构恢复以及关键信息提取等方面均取得了竞争性或业界领先的表现。 其紧凑型架构和结构化生成特性使其既适用于资源受限的边缘部署,也适合大规模生产系统。
https://arxiv.org/abs/2603.10910
SiDiaC-v.2.0 is the largest comprehensive Sinhala Diachronic Corpus to date, covering a period from 1800 CE to 1955 CE in terms of publication dates, and a historical span from the 5th to the 20th century CE in terms of written dates. The corpus consists of 244k words across 185 literary works that underwent thorough filtering, preprocessing, and copyright compliance checks, followed by extensive post-processing. Additionally, a subset of 59 documents totalling 70k words was annotated based on their written dates. Texts from the National Library of Sri Lanka were selected from the SiDiaC-v.1.0 non-filtered list, which was digitised using Google Document AI OCR. This was followed by post-processing to correct formatting issues, address code-mixing, include special tokens, and fix malformed tokens. The construction of SiDiaC-v.2.0 was informed by practices from other corpora, such as FarPaHC, SiDiaC-v.1.0, and CCOHA. This was particularly relevant for syntactic annotation and text normalisation strategies, given the shared characteristics of low-resource language status between Faroese and the similar cleaning strategies utilised in CCOHA. This corpus is categorised into two layers based on genres: primary and secondary. The primary categorisation is binary, assigning each book to either Non-Fiction or Fiction. The secondary categorisation is more detailed, grouping texts under specific genres such as Religious, History, Poetry, Language, and Medical. Despite facing challenges due to limited resources, SiDiaC-v.2.0 serves as a comprehensive resource for Sinhala NLP, building upon the work previously done in SiDiaC-v.1.0.
SiDiaC-v.2.0 是迄今为止规模最大、最全面的僧伽罗语历时语料库,根据出版日期覆盖了从公元1800年到1955年的时期,并且在书写日期方面涵盖了从公元5世纪到20世纪的历史跨度。该语料库由经过彻底过滤、预处理和版权合规检查的185部文学作品中的24.4万个单词组成,随后进行了广泛的后期处理。此外,根据其书写日期对一组包含7万字的59份文档进行了标注。 从斯里兰卡国家图书馆选择的文本来自SiDiaC-v.1.0未经过滤的列表,并使用Google Document AI OCR进行数字化。在此之后,进行了后期处理以纠正格式问题、解决代码混杂现象、包括特殊标记以及修复畸形标记。 SiDiaC-v.2.0 的构建借鉴了其他语料库(如FarPaHC、SiDiaC-v.1.0和CCOHA)的做法,特别是在句法标注和文本规范化策略方面尤为相关。由于低资源语言状况的共同特征以及与CCOHA中使用的类似清理策略。 该语料库根据体裁被分为两层:主要分类是二元制,将每本书归类为非虚构或小说;次要分类更为详细,按宗教、历史、诗歌、语言和医学等特定类别进行分组。尽管由于资源有限面临挑战,SiDiaC-v.2.0 仍作为僧伽罗语自然语言处理的全面资源,在SiDiaC-v.1.0的基础上进一步发展了相关工作。
https://arxiv.org/abs/2603.10861
We present PULSE, a medical reasoning agent that combines a domain-tuned large language model with scientific literature retrieval to support diagnostic decision-making in complex real-world cases. To evaluate its capabilities, we curated a benchmark of 82 authentic endocrinology case reports encompassing a broad spectrum of disease types and incidence levels. In controlled experiments, we compared PULSE's performance against physicians with varying levels of expertise-from residents to senior specialists-and examined how AI assistance influenced human diagnostic reasoning. PULSE attained expert-competitive accuracy, outperforming residents and junior specialists while matching senior specialist performance at both Top@1 and Top@4 thresholds. Unlike physicians, whose accuracy declined with disease rarity, PULSE maintained stable performance across incidence tiers. The agent also exhibited adaptive reasoning, increasing output length with case difficulty in a manner analogous to the longer deliberation observed among expert clinicians. When used collaboratively, PULSE enabled physicians to correct initial errors and broaden diagnostic hypotheses, but also introduced risks of automation bias. The study explores both serial and concurrent collaboration workflows, revealing that PULSE offers robust support across common and rare presentations. These findings underscore both the promise and the limitations of language model-based agents in clinical diagnosis, and offer a framework for evaluating their role in real-world decision-making.
我们介绍了PULSE,这是一种结合了领域调优的大规模语言模型与科学文献检索能力的医疗推理代理,旨在支持复杂真实案例中的诊断决策制定。为了评估其性能,我们精心策划了一个基准测试,包含82份真实的内分泌学病例报告,涵盖了广泛的疾病类型和发生率水平。在控制实验中,我们将PULSE的表现与不同专业级别的医生(从住院医师到资深专家)进行了比较,并考察了人工智能辅助如何影响人类的诊断推理过程。 结果显示,PULSE达到了与专家相媲美的准确度,在Top@1和Top@4阈值下,其性能超越了住院医师和初级专科医生,同时接近资深专科医生的表现。值得注意的是,不同于医生在面对罕见疾病时准确性下降的情况,PULSE在其整个发生率级别上维持稳定表现。此外,该代理还展示了适应性推理能力,在处理难度增加的案例时输出长度相应增长,这与专家临床医生更长的思考过程类似。 当进行协作使用时,PULSE能够帮助医生纠正初始错误并扩大诊断假设范围,但同时也引入了自动化偏见的风险。本研究探索了序列和并发合作的工作流程,并发现PULSE在常见及罕见病例中都能提供强有力的支撑。这些发现在强调基于语言模型代理在临床诊断中的潜力的同时,也揭示了其局限性,并为评估其在现实世界决策制定中的角色提供了框架。
https://arxiv.org/abs/2603.10492
An Automatic License Plate Recognition (ALPR) system constitutes a crucial element in an intelligent traffic management system. However, the detection of Bangla license plates remains challenging because of the complicated character scheme and uneven layouts. This paper presents a robust Bangla License Plate Recognition system that integrates a deep learning-based object detection model for license plate localization with Optical Character Recognition for text extraction. Multiple object detection architectures, including U-Net and several YOLO (You Only Look Once) variants, are compared for license plate localization. This study proposes a novel two-stage adaptive training strategy built upon the YOLOv8 architecture to improve localization performance. The proposed approach outperforms the established models, achieving an accuracy of 97.83% and an Intersection over Union (IoU) of 91.3%. The text recognition problem is phrased as a sequence generation problem with a VisionEncoderDecoder architecture, with a combination of encoder-decoders evaluated. It was demonstrated that the ViT + BanglaBERT model gives better results at the character level, with a Character Error Rate of 0.1323 and Word Error Rate of 0.1068. The proposed system also shows a consistent performance when tested on an external dataset that has been curated for this study purpose. The dataset offers completely different environment and lighting conditions compared to the training sample, indicating the robustness of the proposed framework. Overall, our proposed system provides a robust and reliable solution for Bangla license plate recognition and performs effectively across diverse real-world scenarios, including variations in lighting, noise, and plate styles. These strengths make it well suited for deployment in intelligent transportation applications such as automated law enforcement and access control.
自动车牌识别(ALPR)系统是智能交通管理系统中的一个关键组成部分。然而,孟加拉语车牌的检测依然具有挑战性,原因是其复杂的字符结构和不规则布局。本文介绍了一种鲁棒的孟加拉语车牌识别系统,该系统集成了基于深度学习的对象检测模型进行车牌定位以及光学字符识别技术来提取文本内容。针对车牌定位任务,对比了包括U-Net和多个YOLO(You Only Look Once)变体在内的多种对象检测架构。本研究提出了一种新颖的两阶段自适应训练策略,并在其基础上提出了基于YOLOv8架构的方法以提高定位性能。所提方法在精度上达到了97.83%,交并比(IoU)为91.3%,优于现有模型的表现。 文本识别问题被表述为一个序列生成任务,采用视觉编码器-解码器架构进行处理,并对多种编码器和解码器的组合进行了评估。结果显示,ViT + BanglaBERT模型在字符级别上表现更佳,其字符错误率为0.1323,单词错误率为0.1068。 该系统还在一个专门为此研究目的而编写的外部数据集上展示了稳定的性能,该数据集提供了与训练样本完全不同且多变的环境和光照条件,证明了所提出的框架具有较强的鲁棒性。总体而言,我们的提议系统为孟加拉语车牌识别提供了一种稳健可靠的解决方案,并能在包括光线变化、噪声及不同车牌样式在内的各种真实世界场景中有效运行。 这些优势使该系统非常适合部署于智能交通应用中,如自动化执法和访问控制等。
https://arxiv.org/abs/2603.10267
Unified multimodal models (UMMs) that integrate understanding, reasoning, generation, and editing face inherent trade-offs between maintaining strong semantic comprehension and acquiring powerful generation capabilities. In this report, we present InternVL-U, a lightweight 4B-parameter UMM that democratizes these capabilities within a unified framework. Guided by the principles of unified contextual modeling and modality-specific modular design with decoupled visual representations, InternVL-U integrates a state-of-the-art Multimodal Large Language Model (MLLM) with a specialized MMDiT-based visual generation head. To further bridge the gap between aesthetic generation and high-level intelligence, we construct a comprehensive data synthesis pipeline targeting high-semantic-density tasks, such as text rendering and scientific reasoning, under a reasoning-centric paradigm that leverages Chain-of-Thought (CoT) to better align abstract user intent with fine-grained visual generation details. Extensive experiments demonstrate that InternVL-U achieves a superior performance - efficiency balance. Despite using only 4B parameters, it consistently outperforms unified baseline models with over 3x larger scales such as BAGEL (14B) on various generation and editing tasks, while retaining strong multimodal understanding and reasoning capabilities.
统一多模态模型(UMMs)在保持强大的语义理解能力和获取强大的生成能力之间存在内在的权衡。在此报告中,我们介绍了InternVL-U,这是一个轻量级的40亿参数UMM,在一个统一框架下普及这些功能。遵循统一上下文建模和解耦视觉表示的模块化设计原则,InternVL-U将最先进的多模式大语言模型(MLLM)与基于MMDiT的专门视觉生成头部相结合。为了进一步弥合美学生成和高层次智能之间的差距,我们构建了一个全面的数据合成管道,针对诸如文本渲染和科学推理等高语义密度任务,在以推理为中心的范式中利用链式思维(CoT),从而更好地将抽象用户意图与精细的视觉生成细节对齐。广泛的实验表明,InternVL-U实现了卓越的性能-效率平衡。尽管只使用了40亿参数,但它在各种生成和编辑任务上持续优于具有3倍以上规模的统一基线模型,如BAGEL(140亿),同时保持强大的多模态理解和推理能力。
https://arxiv.org/abs/2603.09877
Large-scale scientific collaborations, such as the Compact Muon Solenoid (CMS) at CERN, produce a vast and ever-growing corpus of internal documentation. Navigating this complex information landscape presents a significant challenge for both new and experienced researchers, hindering knowledge sharing and slowing down the pace of scientific discovery. To address this, we present a prototype of MITRA, a Retrieval-Augmented Generation (RAG) based system, designed to answer specific, context-aware questions about physics analyses. MITRA employs a novel, automated pipeline using Selenium for document retrieval from internal databases and Optical Character Recognition (OCR) with layout parsing for high-fidelity text extraction. Crucially, MITRA's entire framework, from the embedding model to the Large Language Model (LLM), is hosted on-premise, ensuring that sensitive collaboration data remains private. We introduce a two-tiered vector database architecture that first identifies the relevant analysis from abstracts before focusing on the full documentation, resolving potential ambiguities between different analyses. We demonstrate the prototype's superior retrieval performance against a standard keyword-based baseline on realistic queries and discuss future work towards developing a comprehensive research agent for large experimental collaborations.
大规模的科学合作项目,如欧洲核子研究组织(CERN)的紧凑型缪子线圈(CMS),产生了大量的内部文档,并且这些文档还在不断增长。对于新旧研究人员来说,在这个复杂的知识体系中导航是一项重大挑战,这阻碍了知识共享并减缓了科学研究的步伐。为了解决这个问题,我们介绍了一种名为MITRA的原型系统,这是一个基于检索增强生成(RAG)的设计,旨在回答关于物理分析的具体、上下文感知的问题。 MITRA采用了一条新颖且自动化的管道,使用Selenium从内部数据库中提取文档,并利用光学字符识别(OCR)和布局解析技术进行高质量文本抽取。至关重要的是,MITRA的整个框架,包括嵌入模型到大型语言模型(LLM),都部署在本地服务器上,确保敏感的合作数据保持私密。 我们介绍了一种两级向量数据库架构:首先通过摘要来识别相关分析,然后聚焦于完整文档,解决不同分析之间的潜在歧义问题。我们在实际查询中展示了原型系统的检索性能优于标准关键词基础线,并讨论了开发全面研究代理的未来工作的方向,以服务于大规模实验合作项目。
https://arxiv.org/abs/2603.09800
Addressing the challenges of fragmented task definitions and the heterogeneity of unstructured data in multimodal parsing, this paper proposes the Omni Parsing framework. This framework establishes a Unified Taxonomy covering documents, images, and audio-visual streams, introducing a progressive parsing paradigm that bridges perception and cognition. Specifically, the framework integrates three hierarchical levels: 1) Holistic Detection, which achieves precise spatial-temporal grounding of objects or events to establish a geometric baseline for perception; 2) Fine-grained Recognition, which performs symbolization (e.g., OCR/ASR) and attribute extraction on localized objects to complete structured entity parsing; and 3) Multi-level Interpreting, which constructs a reasoning chain from local semantics to global logic. A pivotal advantage of this framework is its evidence anchoring mechanism, which enforces a strict alignment between high-level semantic descriptions and low-level facts. This enables ``evidence-based'' logical induction, transforming unstructured signals into standardized knowledge that is locatable, enumerable, and traceable. Building on this foundation, we constructed a standardized dataset and released the Logics-Parsing-Omni model, which successfully converts complex audio-visual signals into machine-readable structured knowledge. Experiments demonstrate that fine-grained perception and high-level cognition are synergistic, effectively enhancing model reliability. Furthermore, to quantitatively evaluate these capabilities, we introduce OmniParsingBench. Code, models and the benchmark are released at this https URL.
本文提出了一种Omni Parsing框架,旨在解决多模态解析中任务定义碎片化和非结构化数据异质性带来的挑战。该框架建立了一个统一分类法,涵盖文档、图像和音视频流,并引入了从感知到认知的渐进式解析范例。 具体而言,该框架整合了三个层次: 1. **全局检测**:实现对象或事件的精确时空定位,为感知奠定了几何基础。 2. **细粒度识别**:对局部化的目标执行符号化(例如OCR/ASR)和属性提取,完成结构实体解析。 3. **多级解释**:从本地语义到全球逻辑构建推理链。 该框架的一个关键优势是其证据锚定机制,强制实施高级别语义描述与低级别事实之间的严格对齐。这使得基于“证据”的逻辑归纳成为可能,将非结构化信号转化为可定位、可计数和可追踪的标准知识。 在此基础上,我们构建了一个标准化的数据集,并发布了Logics-Parsing-Omni模型,成功地将复杂的音视频信号转换为机器可读的结构化知识。实验表明,细粒度感知与高级别认知是协同作用的,有效提升了模型的可靠性。此外,为了定量评估这些能力,我们引入了OmniParsingBench。 该框架、模型及基准数据集可在以下网址获取:[此URL](请将方括号内的文本替换为实际提供的链接)。
https://arxiv.org/abs/2603.09677
We present the Patrologia Graeca Corpus, the first large-scale open OCR and linguistic resource for nineteenthcentury editions of Ancient Greek. The collection covers the remaining undigitized volumes of the Patrologia Graeca (PG), printed in complex bilingual (Greek-Latin) layouts and characterized by highly degraded polytonic Greek typography. Through a dedicated pipeline combining YOLO-based layout detection and CRNN-based text recognition, we achieve a character error rate (CER) of 1.05% and a word error rate (WER) of 4.69%, largely outperforming existing OCR systems for polytonic Greek. The resulting corpus contains around six million lemmatized and part-of-speech tagged tokens, aligned with full OCR and layout annotations. Beyond its philological value, this corpus establishes a new benchmark for OCR on noisy polytonic Greek and provides training material for future models, including LLMs.
我们介绍了Patrologia Graeca语料库,这是首个针对19世纪古希腊文版的大型开源OCR(光学字符识别)和语言资源。该收藏涵盖了未被数字化的Patrologia Graeca (PG) 卷,这些卷采用复杂的双语文本布局(希伯来-拉丁文),并且特征是高度退化的多音调希腊文字体。 通过结合基于YOLO的版面检测与CRNN基础的文字识别方法组成的专用管道,我们达到了1.05%的字符错误率(CER)和4.69%的单词错误率(WER),这大大超越了现有的针对多音调希腊语的OCR系统。生成的语料库包含大约六百万个经过词形还原且标注了词性标签的词汇单元,与完整的OCR和版面注释对齐。 除了其文献学价值外,该语料库还为嘈杂环境下的多音调希腊语OCR设立了新的基准,并提供了未来模型(包括大语言模型)的训练材料。
https://arxiv.org/abs/2603.09470
Document Image Machine Translation (DIMT) seeks to translate text embedded in document images from one language to another by jointly modeling both textual content and page layout, bridging optical character recognition (OCR) and natural language processing (NLP). The DIMT 2025 Challenge advances research on end-to-end document image translation, a rapidly evolving area within multimodal document understanding. The competition features two tracks, OCR-free and OCR-based, each with two subtasks for small (less than 1B parameters) and large (greater than 1B parameters) models. Participants submit a single unified DIMT system, with the option to incorporate provided OCR transcripts. Running from December 10, 2024 to April 20, 2025, the competition attracted 69 teams and 27 valid submissions in total. Track 1 had 34 teams and 13 valid submissions, while Track 2 had 35 teams and 14 valid submissions. In this report, we present the challenge motivation, dataset construction, task definitions, evaluation protocol, and a summary of results. Our analysis shows that large-model approaches establish a promising new paradigm for translating complex-layout document images and highlight substantial opportunities for future research.
文档图像机器翻译(DIMT)旨在通过同时建模文本内容和页面布局,将嵌入在文档图像中的文字从一种语言翻译成另一种语言,从而连接光学字符识别(OCR)与自然语言处理(NLP)。DIMT 2025挑战赛促进了端到端文档图像翻译研究的发展,这是一个多模式文档理解领域的迅速发展的领域。比赛设有两个赛道:无OCR和基于OCR的赛道,每个赛道分为小型模型(参数小于10亿)和大型模型(参数大于10亿)两种子任务。参赛者提交单一综合DIMT系统,并可以选择使用提供的OCR转录文本。从2024年12月10日到2025年4月20日,比赛吸引了69支队伍,总计收到27份有效提交。第一赛道有34支队伍和13份有效提交;第二赛道则有35支队伍和14份有效提交。在本报告中,我们将介绍挑战赛的动机、数据集构建、任务定义、评估协议以及结果总结。我们的分析表明,大型模型方法为翻译复杂布局文档图像建立了一个具有前景的新范式,并突显了未来研究的巨大机会。 以下是报告的主要内容概述: 1. **挑战赛动机**: - 描述DIMT的重要性及其在多模态文档理解领域的地位。 - 阐明比赛的目的是推动跨OCR和NLP的技术进展,特别是在处理复杂布局的文档图像方面。 2. **数据集构建**: - 介绍用于训练、验证及测试的数据集来源与特性。 - 强调如何利用真实世界的场景来设计挑战赛中的任务,确保其对实际应用具有重要意义。 3. **任务定义**: - 明确两个赛道的任务描述和目标:无OCR方法需要直接处理图像内容;而基于OCR的方法可以利用提取的文字信息进行翻译。 - 说明每种模型规模下的具体要求以及评估标准。 4. **评价协议**: - 描述如何评估参赛作品的质量,包括准确率、效率及创新性等方面的考量。 - 提供了评分准则和方法论以确保公正性和透明度。 5. **结果总结与分析**: - 展示竞赛中不同团队的成果及其在各类指标上的表现。 - 分析大型模型相对于较小模型的优势,特别是在处理复杂文档时表现出的能力提升。 这份报告详细记录并评估了DIMT 2025挑战赛的过程和成果,为未来的研究提供了重要的参考。
https://arxiv.org/abs/2603.09392
Dataset condensation (DC) learns a compact synthetic dataset that enables models to match the performance of full-data training, prioritising utility over distributional fidelity. While typically explored for computational efficiency, DC also holds promise for healthcare data democratisation, especially when paired with differential privacy, allowing synthetic data to serve as a safe alternative to real records. However, existing DC methods rely on differentiable neural networks, limiting their compatibility with widely used clinical models such as decision trees and Cox regression. We address this gap using a differentially private, zero-order optimisation framework that extends DC to non-differentiable models using only function evaluations. Empirical results across six datasets, including both classification and survival tasks, show that the proposed method produces condensed datasets that preserve model utility while providing effective differential privacy guarantees - enabling model-agnostic data sharing for clinical prediction tasks without exposing sensitive patient information.
数据集浓缩(DC)学习一个紧凑的合成数据集,使模型能够在性能上匹配全量数据训练的效果,强调实用性而非分布保真度。虽然通常探索这种技术是为了提高计算效率,但DC在与差分隐私结合时也显示出促进医疗数据民主化的潜力,使得合成数据可以作为真实记录的安全替代品。然而,现有的DC方法依赖于可微的神经网络,这限制了它们与广泛应用的临床模型(如决策树和Cox回归)兼容性。为了解决这一问题,我们采用了一种差分隐私、零阶优化框架,该框架能够仅通过函数评估将DC扩展到非可微模型上。在包括分类任务和生存分析在内的六个数据集上的实证结果表明,所提出的方法能生成保持模型实用性的浓缩数据集,并提供有效的差分隐私保护——从而能够在不暴露敏感患者信息的情况下实现临床预测任务中的模型无关性数据共享。
https://arxiv.org/abs/2603.09356
Document understanding with multimodal large language models (MLLMs) requires not only accurate answers but also explicit, evidence-grounded reasoning, especially in high-stakes scenarios. However, current document MLLMs still fall short of forming a complete, human-like reasoning process, because even when they improve both layout encoding and CoT-style prompting, the interaction between the two is typically learned implicitly and remains loosely coupled rather than being enforced as a systematic mechanism. So we propose DocCogito, a unified framework that integrates global layout perception with structured, region-grounded reasoning. DocCogito introduces a lightweight layout tower that distills page structure into learnable global layout prior tokens, and a deterministic Visual-Semantic Chain (VSC)-a concise structured representation less ambiguous than free-form natural-language CoT-to supervise fine-grained intermediate reasoning aligned with evidence regions. Training follows a progressive recipe, including layout perception pretraining, VSC-guided cold start, rejection sampling, and GRPO. To further strengthen the internal coupling between layout priors and VSC execution, we augment standard rewards with a fine-grained region-confidence signal that encourages reasoning traces to stay aligned with corresponding evidence regions. Extensive experiments on six benchmarks (DocVQA, WTQ, ChartQA, TextVQA, OCRBench, and InfoVQA) demonstrate strong generalization, achieving state-of-the-art results on four benchmarks.
文档理解利用多模态大型语言模型(MLLMs)不仅需要准确的答案,还需要基于证据的明确推理,尤其是在高风险场景中。然而,目前的文档处理 MLLMs 仍然无法形成一个完整的、类似人类的推理过程,因为即使在改进布局编码和 CoT 风格提示的情况下,两者之间的交互通常是隐式学习到的,并且仍然是松散耦合而非系统化机制。因此,我们提出了 DocCogito,这是一个统一框架,它将全局布局感知与结构化、基于区域的推理相结合。DocCogito 引入了一个轻量级的布局塔,该塔将页面结构提炼成可学习的全局布局先验令牌,并引入了一个确定性的视觉-语义链(VSC),这是一种比自由形式自然语言 CoT 更少歧义的简洁结构化表示,用于监督与证据区域对齐的细粒度中间推理。训练遵循逐步食谱,包括布局感知预训练、VSC 引导冷启动、拒绝采样和 GRPO。为了进一步加强布局先验和 VSC 执行之间的内部耦合,我们通过提供一个细化的区域置信信号来增强标准奖励,该信号鼓励推理轨迹与相应的证据区域保持一致。 在六个基准测试(DocVQA、WTQ、ChartQA、TextVQA、OCRBench 和 InfoVQA)上的广泛实验表明了强大的泛化能力,并在这四个基准测试上实现了最先进的结果。
https://arxiv.org/abs/2603.07494
Deliberative democratic theory suggests that civic competence: the capacity to navigate disagreement, weigh competing values, and arrive at collective decisions is not innate but developed through practice. Yet opportunities to cultivate these skills remain limited, as traditional deliberative processes like citizens' assemblies reach only a small fraction of the population. We present Agora, an early-stage AI-powered platform that uses LLMs to organize authentic human voices on policy issues, helping users build consensus-finding skills by proposing and revising policy recommendations, hearing supporting and opposing perspectives, and receiving feedback on how policy changes affect predicted support. In a preliminary study with 44 university students, participants using the full interface (with access to voice explanations) reported higher levels of problem-solving skills, internal deliberation, and produced higher quality consensus statements compared to a control condition showing only aggregate support distributions. These initial findings point toward a promising direction for scaling civic education.
审议民主理论认为,公民能力——即在分歧中导航、权衡竞争的价值观并达成集体决策的能力,并非天生具备,而是通过实践发展起来的。然而,培养这些技能的机会仍然有限,传统的审议过程如市民大会只能触及一小部分人口。我们介绍了Agora,这是一个处于早期阶段的人工智能平台,利用大型语言模型(LLMs)组织有关政策问题的真实人类声音,帮助用户在提出和修订政策建议、听取支持与反对的观点以及了解政策变化如何影响预期支持方面建立达成共识的技能。 一项初步研究中,44名大学生参与者使用了完整的Agora界面(能够访问语音解释),报告称其解决问题的能力、内部审议水平以及产生的共识声明质量都比仅显示汇总支持分布情况的对照组更高。这些初步发现表明,通过扩大公民教育规模来促进民主参与有了一条具有前景的道路。
https://arxiv.org/abs/2603.07339