Many document types use intrinsic, convention-driven structures that serve to encode precise and structured information, such as the conventions governing engineering drawings. However, state-of-the-art approaches treat document recognition as a mere computer vision problem, neglecting these underlying document-type-specific structural properties, making them dependent on sub-optimal heuristic post-processing and rendering many less frequent or more complicated document types inaccessible to modern document recognition. We suggest a novel perspective that frames document recognition as a transcription task from a document to a record. This implies a natural grouping of documents based on the intrinsic structure inherent in their transcription, where related document types can be treated (and learned) similarly. We propose a method to design structure-specific inductive biases for the underlying machine-learned end-to-end document recognition systems, and a respective base transformer architecture that we successfully adapt to different structures. We demonstrate the effectiveness of the so-found inductive biases in extensive experiments with progressively complex record structures from monophonic sheet music, shape drawings, and simplified engineering drawings. By integrating an inductive bias for unrestricted graph structures, we train the first-ever successful end-to-end model to transcribe engineering drawings to their inherently interlinked information. Our approach is relevant to inform the design of document recognition systems for document types that are less well understood than standard OCR, OMR, etc., and serves as a guide to unify the design of future document foundation models.
许多文档类型采用了内在的、约定驱动的结构,这些结构用于编码精确且有组织的信息,例如工程图纸中的惯例。然而,当前最先进的方法将文档识别视为一个纯粹的计算机视觉问题,忽视了特定类型的文档所具有的底层结构性质,导致它们依赖于次优的启发式后处理,并使许多不太常见或更复杂的文档类型无法被现代文档识别系统所理解。 我们提出了一种新的视角,即将文档识别任务重新定义为从文档向记录转录的任务。这一视角自然地将文档依据其内在结构进行分组,在此基础上相关的文档类型可以得到类似且一致的处理和学习方法。我们建议设计针对特定结构的归纳偏差(inductive biases),并提出了一种基础Transformer架构,成功地将其适应于不同的结构形式。 通过一系列逐步复杂的记录结构实验——从单声部乐谱、形状绘图到简化的工程图纸,证明了我们发现的这种归纳偏好的有效性。通过集成对不受限图形结构的归纳偏差,训练出了第一个成功的端到端模型,可以将工程图纸转录为其内在相互关联的信息。 我们的方法对于设计针对不如标准OCR(光学字符识别)、OMR(乐谱识别)等那样被广泛理解的文档类型识别系统具有重要意义,并可作为未来文档基础模型设计统一化的一种指导。
https://arxiv.org/abs/2507.08458
Multimodal Large Language Models (MLLMs) have shown strong performance in document image tasks, especially Optical Character Recognition (OCR). However, they struggle with Document Image Machine Translation (DIMT), which requires handling both cross-modal and cross-lingual challenges. Previous efforts to enhance DIMT capability through Supervised Fine-Tuning (SFT) on the DIMT dataset often result in the forgetting of the model's existing monolingual abilities, such as OCR. To address these challenges, we introduce a novel fine-tuning paradigm, named Synchronously Self-Reviewing (SSR) its OCR proficiency, inspired by the concept "Bilingual Cognitive Advantage". Specifically, SSR prompts the model to generate OCR text before producing translation text, which allows the model to leverage its strong monolingual OCR ability while learning to translate text across languages. Comprehensive experiments demonstrate the proposed SSR learning helps mitigate catastrophic forgetting, improving the generalization ability of MLLMs on both OCR and DIMT tasks.
多模态大型语言模型(MLLMs)在文档图像任务中表现出色,尤其是在光学字符识别(OCR)方面。然而,它们在处理文档图像机器翻译(DIMT)时遇到困难,因为这需要同时应对跨模式和跨语言的挑战。之前通过在DIMT数据集上进行监督微调(SFT)来增强DIMT能力的方法往往会遗忘模型原有的单语技能,如OCR。为了解决这些挑战,我们引入了一种新的微调范式——同步自审(SSR),旨在保持其OCR专业技能,灵感来自“双语认知优势”这一概念。具体来说,SSR提示模型在生成翻译文本之前先生成OCR文本,这使得模型能够利用其强大的单语言OCR能力同时学习跨语言的文本翻译。全面的实验表明,所提出的SSR学习有助于缓解灾难性遗忘,从而提高MLLMs在OCR和DIMT任务上的泛化能力。
https://arxiv.org/abs/2507.08309
Out-of-context reasoning (OOCR) is a phenomenon in which fine-tuned LLMs exhibit surprisingly deep out-of-distribution generalization. Rather than learning shallow heuristics, they implicitly internalize and act on the consequences of observations scattered throughout the fine-tuning data. In this work, we investigate this phenomenon mechanistically and find that many instances of OOCR in the literature have a simple explanation: the LoRA fine-tuning essentially adds a constant steering vector, steering the model towards a general concept. This improves performance on the fine-tuning task and in many other concept-related domains, causing the surprising generalization. Moreover, we can directly train steering vectors for these tasks from scratch, which also induces OOCR. We find that our results hold even for a task that seems like it must involve conditional behavior (model backdoors); it turns out that unconditionally adding a steering vector is sufficient. Overall, our work presents one explanation of what gets learned during fine-tuning for OOCR tasks, contributing to the key question of why LLMs can reason out of context, an advanced capability that is highly relevant to their safe and reliable deployment.
出界上下文推理(OOCR)是指经过微调的大规模语言模型在处理分布外数据时表现出令人惊讶的深度泛化能力的现象。与其说它们学习了浅层规则,不如说这些模型隐式地吸收并作用于细调过程中散布的各种观察结果的影响。在这项工作中,我们从机制上研究了这种现象,并发现文献中许多OOCR实例都有一个简单的解释:LoRA微调实际上是在模型中添加了一个常量导向向量,将模型引导至某个普遍概念的方向。这不仅提高了针对特定微调任务的表现,还在很多与该概念相关的领域表现出色,从而产生令人惊讶的泛化能力。此外,我们还可以直接从头开始训练这些任务的导向向量,这也诱导出OOCR现象。即使在似乎必须涉及条件行为的任务(如模型后门)上,我们的发现同样适用——结果表明无条件地添加一个导向向量就已经足够。 总体来说,这项工作提出了一种关于OOCR任务中微调过程中学习内容的解释,并为探讨为何大语言模型能够进行出界推理提供了关键视角。这种先进的能力对于它们的安全可靠部署具有重要意义。
https://arxiv.org/abs/2507.08218
Robots can better interact with humans and unstructured environments through touch sensing. However, most commercial robots are not equipped with tactile skins, making it challenging to achieve even basic touch-sensing functions, such as contact localization. We present UniTac, a data-driven whole-body touch-sensing approach that uses only proprioceptive joint sensors and does not require the installation of additional sensors. Our approach enables a robot equipped solely with joint sensors to localize contacts. Our goal is to democratize touch sensing and provide an off-the-shelf tool for HRI researchers to provide their robots with touch-sensing capabilities. We validate our approach on two platforms: the Franka robot arm and the Spot quadruped. On Franka, we can localize contact to within 8.0 centimeters, and on Spot, we can localize to within 7.2 centimeters at around 2,000 Hz on an RTX 3090 GPU without adding any additional sensors to the robot. Project website: this https URL.
机器人可以通过触觉感应更好地与人类和非结构化环境互动。然而,大多数商用机器人并未配备触觉皮肤,这使得实现诸如接触定位等基本的触感功能变得困难。我们提出了UniTac,这是一种基于数据驱动的整体触觉感知方法,仅使用本体感受关节传感器,并不需要安装额外的传感器。我们的方法使只配备了关节传感器的机器人都能够进行接触定位。我们的目标是让触觉感应更加普及,并为人机交互(HRI)研究者提供现成的工具,以赋予他们的机器人触觉感知能力。我们在两个平台上验证了该方法的有效性:Franka机械臂和Spot四足机器人。在Franka上,我们能够将接触定位到8.0厘米范围内;而在Spot上,在配备RTX 3090 GPU的情况下,无需添加任何额外的传感器即可实现在约2,000 Hz频率下的7.2厘米范围内的接触定位。 项目网站:[此处插入实际链接]
https://arxiv.org/abs/2507.07980
The accelerating growth of photographic collections has outpaced manual cataloguing, motivating the use of vision language models (VLMs) to automate metadata generation. This study examines whether Al-generated catalogue descriptions can approximate human-written quality and how generative Al might integrate into cataloguing workflows in archival and museum collections. A VLM (InternVL2) generated catalogue descriptions for photographic prints on labelled cardboard mounts with archaeological content, evaluated by archive and archaeology experts and non-experts in a human-centered, experimental framework. Participants classified descriptions as AI-generated or expert-written, rated quality, and reported willingness to use and trust in AI tools. Classification performance was above chance level, with both groups underestimating their ability to detect Al-generated descriptions. OCR errors and hallucinations limited perceived quality, yet descriptions rated higher in accuracy and usefulness were harder to classify, suggesting that human review is necessary to ensure the accuracy and quality of catalogue descriptions generated by the out-of-the-box model, particularly in specialized domains like archaeological cataloguing. Experts showed lower willingness to adopt AI tools, emphasizing concerns on preservation responsibility over technical performance. These findings advocate for a collaborative approach where AI supports draft generation but remains subordinate to human verification, ensuring alignment with curatorial values (e.g., provenance, transparency). The successful integration of this approach depends not only on technical advancements, such as domain-specific fine-tuning, but even more on establishing trust among professionals, which could both be fostered through a transparent and explainable AI pipeline.
摄影收藏的快速增长已经超过了手动编目的速度,这促使了使用视觉语言模型(VLMs)来自动化元数据生成。本研究考察了AI生成的目录描述是否能接近人类书写的质量,并探讨了生成型人工智能如何整合到档案和博物馆藏品的编目工作流程中。一个VLM(InternVL2)为标有考古内容的照片底片生成了目录描述,这些描述由档案和考古专家以及非专业人士在一个以人为中心、实验性的框架下进行了评估。参与者将描述分类为AI生成或专家撰写,并对其质量进行评分,同时报告他们使用并信任AI工具的意愿。分类表现超过了偶然水平,两个群体都低估了自己的识别AI生成描述的能力。光学字符识别(OCR)错误和幻觉限制了感知的质量,然而,准确性及有用性较高的描述更难以被分类,这表明在特定领域如考古编目中,为了确保由现成模型生成的目录描述的准确性和质量,人工审核是必要的。专家们表现出较低接受AI工具的意愿,强调了对保存责任的关注超过了技术性能的重要性。这些发现倡导了一种协作的方法,在这种方法中,AI支持草稿生成但仍然次于人类验证,以确保与策展价值(如出处、透明度)的一致性。这种方法的成功整合不仅依赖于特定领域的微调等技术进步,而且更加依赖于在专业人士之间建立信任,这可以通过一个透明和可解释的AI流程来促进。
https://arxiv.org/abs/2507.07551
This position paper explores pluriperspectivism as a core element of human creative experience and its relevance to humanrobot cocreativity We propose a layered fivedimensional model to guide the design of cocreative behaviors and the analysis of interaction dynamics This model is based on literature and results from an interview study we conducted with 10 visual artists and 8 arts educators examining how pluriperspectivism supports creative practice The findings of this study provide insight in how robots could enhance human creativity through adaptive contextsensitive behavior demonstrating the potential of pluriperspectivism This paper outlines future directions for integrating pluriperspectivism with visionlanguage models VLMs to support context sensitivity in cocreative robots
这份立场论文探讨了多视角主义作为人类创造性体验的核心要素及其在人机协同创作中的相关性。我们提出了一种分层的五维模型,旨在指导协同创意行为的设计以及互动动态的分析。该模型基于文献研究和我们对10位视觉艺术家及8位艺术教育者的访谈研究结果,探讨了多视角主义如何支持创造性实践。这项研究的结果提供了有关机器人如何通过自适应的情境敏感性行为增强人类创造力的见解,并展示了多视角主义的巨大潜力。本文还概述了未来将多视角主义与视觉语言模型(VLMs)结合的方向,以支持协同创意机器人的情境敏感性。
https://arxiv.org/abs/2507.07550
This paper presents the design and development of an OCR-powered pipeline for efficient table extraction from invoices. The system leverages Tesseract OCR for text recognition and custom post-processing logic to detect, align, and extract structured tabular data from scanned invoice documents. Our approach includes dynamic preprocessing, table boundary detection, and row-column mapping, optimized for noisy and non-standard invoice formats. The resulting pipeline significantly improves data extraction accuracy and consistency, supporting real-world use cases such as automated financial workflows and digital archiving.
本文介绍了一种基于OCR的管道设计与开发,用于从发票中高效提取表格信息。该系统利用Tesseract OCR进行文本识别,并结合自定义后处理逻辑来检测、对齐并从扫描的发票文档中抽取结构化的表格数据。我们的方法包括动态预处理、表格边界检测和行列映射,特别针对嘈杂且非标准的发票格式进行了优化。所生成的管道显著提高了数据提取的准确性和一致性,支持诸如自动化财务工作流程和数字存档等实际应用场景。
https://arxiv.org/abs/2507.07029
Co-speech gesture video generation aims to synthesize realistic, audio-aligned videos of speakers, complete with synchronized facial expressions and body gestures. This task presents challenges due to the significant one-to-many mapping between audio and visual content, further complicated by the scarcity of large-scale public datasets and high computational demands. We propose a lightweight framework that utilizes 2D full-body skeletons as an efficient auxiliary condition to bridge audio signals with visual outputs. Our approach introduces a diffusion model conditioned on fine-grained audio segments and a skeleton extracted from the speaker's reference image, predicting skeletal motions through skeleton-audio feature fusion to ensure strict audio coordination and body shape consistency. The generated skeletons are then fed into an off-the-shelf human video generation model with the speaker's reference image to synthesize high-fidelity videos. To democratize research, we present CSG-405-the first public dataset with 405 hours of high-resolution videos across 71 speech types, annotated with 2D skeletons and diverse speaker demographics. Experiments show that our method exceeds state-of-the-art approaches in visual quality and synchronization while generalizing across speakers and contexts.
演讲伴随的手势视频生成旨在合成与音频同步的、包含面部表情和身体动作的真实人物视频。这项任务面临着挑战,因为语音与视觉内容之间存在显著的一对多映射关系,并且由于大规模公开数据集的缺乏以及高昂的计算需求而变得更加复杂。我们提出了一种轻量级框架,该框架利用2D全身骨架作为高效的辅助条件来连接音频信号和视觉输出。我们的方法引入了一个基于细粒度音频片段并结合从演讲者参考图像中提取的骨骼信息进行条件化的扩散模型,通过融合骨骼-音频特征预测骨骼运动以确保严格的音视频同步和身体形状的一致性。生成的骨架随后被输入到现有的人脸视频生成模型中,并与演讲者的参考图像一起使用来合成高保真度的视频。 为了促进研究民主化,我们提供了CSG-405——这是第一个公开的数据集,包含71种不同类型的高质量语音,总时长为405小时,并且标注有2D骨架和多样化的说话人人口统计信息。实验表明,我们的方法在视觉质量和同步性方面超过了现有最佳的方法,并且能够跨说话者和上下文进行泛化。
https://arxiv.org/abs/2507.06812
Manchu, a critically endangered language essential for understanding early modern Eastern Eurasian history, lacks effective OCR systems that can handle real-world historical documents. This study develops high-performing OCR systems by fine-tuning three open-source vision-language models (LLaMA-3.2-11B, Qwen2.5-VL-7B, Qwen2.5-VL-3B) on 60,000 synthetic Manchu word images using parameter-efficient training. LLaMA-3.2-11B achieved exceptional performance with 98.3\% word accuracy and 0.0024 character error rate on synthetic data, while crucially maintaining 93.1\% accuracy on real-world handwritten documents. Comparative evaluation reveals substantial advantages over traditional approaches: while a CRNN baseline achieved 99.8\% synthetic accuracy, it suffered severe degradation to 72.5\% on real documents. Our approach demonstrates effective synthetic-to-real domain transfer, providing a cost-effective solution deployable on accessible infrastructure. This work establishes a transferable framework for endangered language OCR that removes technical and financial barriers in digital humanities, enabling historians and linguists to process historical archives without specialized computing resources. Code and model weights are available at this https URL.
满语是一种濒临灭绝的语言,对于理解东亚欧早期现代史至关重要。然而,目前缺乏能够有效处理现实世界历史文件的OCR系统。本研究通过在60,000个合成的满文单词图像上使用参数高效训练方法对三种开源视觉语言模型(LLaMA-3.2-11B、Qwen2.5-VL-7B和Qwen2.5-VL-3B)进行微调,开发出了高性能OCR系统。在合成数据集上,LLaMA-3.2-11B达到了98.3%的单词准确率和0.0024的字符错误率,并且在实际手写文档中保持了高达93.1%的准确性。对比评估表明,与传统方法相比具有显著优势:虽然CRNN基准模型在合成数据上实现了99.8%的准确性,但在真实文件上的准确度却大幅下降到72.5%。本研究提出的方法展示了从合成数据向实际应用的有效迁移能力,提供了一种成本效益高且可部署于常见基础设施解决方案。这一工作为濒危语言OCR建立了一个具有转移性的框架,消除了数字人文领域的技术和财务障碍,使历史学家和语言学者能够在没有专门计算资源的情况下处理历史档案。代码和模型权重可以在以下网址获取:[提供的URL]
https://arxiv.org/abs/2507.06761
The role of civil society organizations (CSOs) in monitoring harmful online content is increasingly crucial, especially as platform providers reduce their investment in content moderation. AI tools can assist in detecting and monitoring harmful content at scale. However, few open-source tools offer seamless integration of AI models and social media monitoring infrastructures. Given their thematic expertise and contextual understanding of harmful content, CSOs should be active partners in co-developing technological tools, providing feedback, helping to improve models, and ensuring alignment with stakeholder needs and values, rather than as passive 'consumers'. However, collaborations between the open source community, academia, and civil society remain rare, and research on harmful content seldom translates into practical tools usable by civil society actors. This work in progress explores how CSOs can be meaningfully involved in an AI-assisted open-source monitoring tool of anti-democratic movements on Telegram, which we are currently developing in collaboration with CSO stakeholders.
民间社会组织(CSOs)在监督有害在线内容方面的作用日益重要,尤其是在平台提供商减少其对内容管理的投资时。人工智能工具可以在大规模检测和监控有害内容方面提供帮助。然而,很少有开源工具能够无缝集成AI模型和社会媒体监测基础设施。鉴于他们主题专业知识和对有害内容的背景理解,CSOs应积极参与技术工具的合作开发,提供反馈,帮助改进模型,并确保与利益相关者的需求和价值观保持一致,而不是被动地作为“消费者”存在。然而,开源社区、学术界和民间社会之间的合作仍然罕见,有关有害内容的研究很少转化为民间社会组织可使用的实用工具。这项正在进行的工作探讨了CSOs如何能够有意义地参与我们正在与CSO利益相关者合作开发的一款旨在监控Telegram上反民主运动的人工智能辅助开源监测工具中。
https://arxiv.org/abs/2507.06734
Recent studies have revealed a consistent liberal orientation in the ethical and political responses generated by most commercial large language models (LLMs), yet the underlying causes and resulting implications remain unclear. This paper systematically investigates the political temperament of seven prominent LLMs - OpenAI's GPT-4o, Anthropic's Claude Sonnet 4, Perplexity (Sonar Large), Google's Gemini 2.5 Flash, Meta AI's Llama 4, Mistral 7b Le Chat and High-Flyer's DeepSeek R1 -- using a multi-pronged approach that includes Moral Foundations Theory, a dozen established political ideology scales and a new index of current political controversies. We find strong and consistent prioritization of liberal-leaning values, particularly care and fairness, across most models. Further analysis attributes this trend to four overlapping factors: Liberal-leaning training corpora, reinforcement learning from human feedback (RLHF), the dominance of liberal frameworks in academic ethical discourse and safety-driven fine-tuning practices. We also distinguish between political "bias" and legitimate epistemic differences, cautioning against conflating the two. A comparison of base and fine-tuned model pairs reveals that fine-tuning generally increases liberal lean, an effect confirmed through both self-report and empirical testing. We argue that this "liberal tilt" is not a programming error or the personal preference of programmers but an emergent property of training on democratic rights-focused discourse. Finally, we propose that LLMs may indirectly echo John Rawls' famous veil-of ignorance philosophical aspiration, reflecting a moral stance unanchored to personal identity or interest. Rather than undermining democratic discourse, this pattern may offer a new lens through which to examine collective reasoning.
最近的研究揭示了大多数商业大型语言模型(LLM)在伦理和政治回应上具有一致的自由派倾向,但其背后的成因和影响尚不清楚。本文采用多角度方法对七种突出的LLM——OpenAI的GPT-4o、Anthropic的Claude Sonnet 4、Perplexity (Sonar Large)、Google的Gemini 2.5 Flash、Meta AI的Llama 4、Mistral 7b Le Chat和High-Flyer的DeepSeek R1——进行系统的政治倾向研究,方法包括道德基础理论、十二项已确立的政治意识形态量表以及一项当前政治争议的新指数。我们发现,在大多数模型中,自由派偏好的价值观被显著且一致地优先考虑,特别是关怀与公正的价值观。进一步分析将这一趋势归因于四个相互重叠的因素:以自由主义为主的训练语料库、从人类反馈中的强化学习(RLHF)、学术伦理讨论中自由框架的主导地位以及基于安全驱动的微调实践。 此外,本文还区分了政治“偏见”和有根据的知识差异,并告诫不要将两者混淆。通过对原始模型与微调后的模型对进行比较,发现微调通常增加了自由派倾向,这一结论既通过自我报告又通过实证测试得到了证实。我们主张,“左倾”不是编程错误或程序员个人偏好,而是基于民主权利导向的讨论训练中的涌现特性。 最后,本文提出LLM可能间接地反映了约翰·罗尔斯著名的无知之幕哲学追求,即体现了一种不依赖于个人身份或利益的道德立场。这一模式并未削弱民主话语,反而提供了一个新的视角来审视集体推理。
https://arxiv.org/abs/2507.08027
The prevailing paradigm for scaling large language models (LLMs) involves monolithic, end-to-end training, a resource-intensive process that lacks flexibility. This paper explores an alternative, constructive approach to model development, built upon the foundation of non-trainable, deterministic input embeddings. In prior [1], we established that high-level semantic reasoning can emerge in Transformers using frozen embeddings derived from the visual structure of Unicode glyphs. Here, we demonstrate that this fixed representational substrate acts as a universal "docking port," enabling two powerful and efficient scaling paradigms: seamless modular composition and progressive layer-wise growth. First, we show that specialist models trained on disparate datasets (e.g., Russian and Chinese text) can be merged into a single, more capable Mixture-of-Experts (MoE) model, post-training, with zero architectural modification. This is achieved by simply averaging their output logits. The resulting MoE model exhibits immediate performance improvements on reasoning benchmarks like MMLU, surpassing its constituent experts without catastrophic forgetting. Second, we introduce a layer-wise constructive training methodology, where a deep Transformer is "grown" by progressively stacking and training one layer at a time. This method demonstrates stable convergence and a clear correlation between model depth and the emergence of complex reasoning abilities, such as those required for SQuAD. Our findings suggest a paradigm shift from monolithic optimization towards a more biological or constructive model of AI development, where complexity is built incrementally and modules can be composed freely. This opens new avenues for resource-efficient scaling, continual learning, and a more democratized ecosystem for building powerful AI systems. We release all code and models to facilitate further research.
当前扩大大规模语言模型(LLMs)规模的主流范式是采用整体、端到端的训练方法,这是一种资源密集型过程且缺乏灵活性。本文探讨了一种替代的构建性开发模型的方法,该方法基于非可训练的确定性输入嵌入基础之上建立。在先前的研究[1]中,我们已经证明了通过冻结从Unicode字符视觉结构派生出的嵌入层,高级语义推理可以在Transformer架构中出现。本文进一步展示了这种固定的表示基底可以作为通用“对接端口”,从而支持两种强大而高效的扩展范式:无缝模块化组合和逐层渐进增长。 首先,我们证明了在不进行任何架构修改的情况下,在训练后可以通过简单地平均它们的输出logits将分别针对不同数据集(例如俄语文本和中文文本)训练出的专业模型合并为一个更强大的专家混合(MoE)模型。结果表明,这种MoE模型立刻在像MMLU这样的推理基准测试上表现出性能提升,并且超越了其组成专家的表现而不会出现灾难性遗忘现象。 其次,我们介绍了一种逐层构建性的训练方法,在这种方法中,通过逐步叠加和单独训练每一层来“生长”一个深层Transformer。这种技术展示了稳定收敛的特点,并且模型深度与复杂推理能力(如SQuAD所需的那种)的出现之间存在明确的相关关系。 我们的研究结果表明了从整体优化转向生物或构建性AI开发模式的重大转变,即以增量方式建立复杂度并自由组合模块。这为资源高效的扩展、持续学习以及更民主化的强大AI系统建设生态系统开辟了新的途径。我们将发布所有代码和模型以促进进一步的研究。
https://arxiv.org/abs/2507.07129
Notable breakthroughs in unified understanding and generation modeling have led to remarkable advancements in image understanding, reasoning, production and editing, yet current foundational models predominantly focus on processing images, creating a gap in the development of unified models for video understanding and generation. This report presents Omni-Video, an efficient and effective unified framework for video understanding, generation, as well as instruction-based editing. Our key insight is to teach existing multimodal large language models (MLLMs) to produce continuous visual clues that are used as the input of diffusion decoders, which produce high-quality videos conditioned on these visual clues. To fully unlock the potential of our system for unified video modeling, we integrate several technical improvements: 1) a lightweight architectural design that respectively attaches a vision head on the top of MLLMs and a adapter before the input of diffusion decoders, the former produce visual tokens for the latter, which adapts these visual tokens to the conditional space of diffusion decoders; and 2) an efficient multi-stage training scheme that facilitates a fast connection between MLLMs and diffusion decoders with limited data and computational resources. We empirically demonstrate that our model exhibits satisfactory generalization abilities across video generation, editing and understanding tasks.
在统一理解和生成模型方面的重要突破,已经显著推动了图像理解、推理、生产及编辑领域的进步。然而,目前的基础模型主要集中在处理图像上,这导致视频理解和生成的统一模型的发展出现了一个缺口。本报告介绍了Omni-Video,这是一种高效且有效的框架,用于视频的理解、生成以及基于指令的编辑。我们的关键洞察是教导现有的多模态大型语言模型(MLLMs)产生连续的视觉线索,并将其用作扩散解码器输入,从而生成高质量的视频。为了充分发挥我们在统一视频建模系统中的潜力,我们整合了几项技术改进:1)一种轻量级架构设计,在MLLMs顶部添加一个视觉头并在扩散解码器输入之前添加一个适配器,前者产生视觉标记供后者使用,并将这些视觉标记适应到扩散解码器的条件空间中;以及2)一个多阶段训练方案,该方案在有限的数据和计算资源下促进了MLLMs与扩散解码器之间的快速连接。我们实证表明,我们的模型在视频生成、编辑和理解任务上展现了令人满意的泛化能力。
https://arxiv.org/abs/2507.06119
The modern text-to-image diffusion models boom has opened a new era in digital content production as it has proven the previously unseen ability to produce photorealistic and stylistically diverse imagery based on the semantics of natural-language descriptions. However, the consistent disadvantage of these models is that they cannot generate readable, meaningful, and correctly spelled text in generated images, which significantly limits the use of practical purposes like advertising, learning, and creative design. This paper introduces a new framework, namely Glyph-Conditioned Diffusion with Character-Aware Attention (GCDA), using which a typical diffusion backbone is extended by three well-designed modules. To begin with, the model has a dual-stream text encoder that encodes both semantic contextual information and explicit glyph representations, resulting in a character-aware representation of the input text that is rich in nature. Second, an attention mechanism that is aware of the character is proposed with a new attention segregation loss that aims to limit the attention distribution of each character independently in order to avoid distortion artifacts. Lastly, GCDA has an OCR-in-the-loop fine-tuning phase, where a full text perceptual loss, directly optimises models to be legible and accurately spell. Large scale experiments to benchmark datasets, such as MARIO-10M and T2I-CompBench, reveal that GCDA sets a new state-of-the-art on all metrics, with better character based metrics on text rendering (Character Error Rate: 0.08 vs 0.21 for the previous best; Word Error Rate: 0.15 vs 0.25), human perception, and comparable image synthesis quality on high-fidelity (FID: 14.3).
现代的文本到图像扩散模型的繁荣开启了一个数字内容生产的全新时代,这证明了这些模型能够基于自然语言描述生成逼真且风格多样的图像。然而,这些模型的一个持续性缺点是它们无法在生成的图像中产生可读、有意义且拼写正确的文字,这极大地限制了其在广告、教育和创意设计等实际应用中的使用。 本文介绍了一种新的框架,称为Glyph-Conditioned Diffusion with Character-Aware Attention (GCDA),该框架通过三个精心设计的模块扩展了一个典型的扩散主干网络。首先,模型具有一个双流文本编码器,它可以同时编码语义上下文信息和明确的文字表示,从而生成一种以字符为中心、性质丰富的输入文本表示。其次,提出了一种新的基于字符的注意力机制,并引入了注意力分割损失,旨在独立限制每个字符的注意分布,以避免失真伪影。最后,GCDA 包含 OCR 在回路中的微调阶段,在此过程中直接优化模型使其更加易读和准确拼写。 大规模实验表明,在诸如 MARIO-10M 和 T2I-CompBench 等基准数据集上,GCDA 在所有指标上都达到了新的最先进水平。具体来说,在文字渲染(字符错误率:0.08 对比之前的最佳成绩 0.21;单词错误率:0.15 对比之前的最佳成绩 0.25)、人类感知以及高质量图像合成质量(FID: 14.3)方面,GCDA 的性能显著优于现有方法。
https://arxiv.org/abs/2507.06033
Document reconstruction constitutes a significant facet of document analysis and recognition, a field that has been progressively accruing interest within the scholarly community. A multitude of these researchers employ an array of document understanding models to generate predictions on distinct subtasks, subsequently integrating their results into a holistic document reconstruction format via heuristic principles. Nevertheless, these multi-stage methodologies are hindered by the phenomenon of error propagation, resulting in suboptimal performance. Furthermore, contemporary studies utilize generative models to extract the logical sequence of plain text, tables and mathematical expressions in an end-to-end process. However, this approach is deficient in preserving the information related to element layouts, which are vital for document reconstruction. To surmount these aforementioned limitations, we in this paper present an innovative autoregressive model specifically designed for document reconstruction, referred to as Document Reconstruction via End-to-end Autoregressive Model (DREAM). DREAM transmutes the text image into a sequence of document reconstruction in a comprehensive, end-to-end process, encapsulating a broader spectrum of document element information. In addition, we establish a standardized definition of the document reconstruction task, and introduce a novel Document Similarity Metric (DSM) and DocRec1K dataset for assessing the performance of the task. Empirical results substantiate that our methodology attains unparalleled performance in the realm of document reconstruction. Furthermore, the results on a variety of subtasks, encompassing document layout analysis, text recognition, table structure recognition, formula recognition and reading order detection, indicate that our model is competitive and compatible with various tasks.
文档重建是文档分析和识别领域的一个重要方面,近年来在学术界引起了越来越多的兴趣。许多研究人员使用一系列的文档理解模型来生成不同子任务的预测结果,并通过启发式原则将这些结果整合为一个整体的文档重构格式。然而,这些多阶段方法由于错误传播的现象而受到限制,导致性能不佳。目前的研究还利用生成模型以端到端的方式提取纯文本、表格和数学表达式的逻辑顺序。不过,这种方法在保存对元素布局至关重要的信息方面存在不足。 为了克服上述局限性,我们在本文中提出了一种专为文档重建设计的创新自回归模型,称为“通过端到端自回归模型进行文档重构”(Document Reconstruction via End-to-end Autoregressive Model, DREAM)。DREAM将文本图像转换为一系列文档重构信息,在一个全面且端到端的过程中涵盖了更广泛的文档元素信息。此外,我们还制定了文档重建任务的标准定义,并引入了一种新的文档相似度衡量(DSM)和DocRec1K数据集来评估该任务的性能。实验结果证实,我们的方法在文档重构领域达到了前所未有的性能水平。另外,在包括文档布局分析、文本识别、表格结构识别、公式识别和阅读顺序检测等多种子任务上的结果表明,我们的模型具有竞争力并适用于各种任务。
https://arxiv.org/abs/2507.05805
This technical report introduces PaddleOCR 3.0, an Apache-licensed open-source toolkit for OCR and document parsing. To address the growing demand for document understanding in the era of large language models, PaddleOCR 3.0 presents three major solutions: (1) PP-OCRv5 for multilingual text recognition, (2) PP-StructureV3 for hierarchical document parsing, and (3) PP-ChatOCRv4 for key information extraction. Compared to mainstream vision-language models (VLMs), these models with fewer than 100 million parameters achieve competitive accuracy and efficiency, rivaling billion-parameter VLMs. In addition to offering a high-quality OCR model library, PaddleOCR 3.0 provides efficient tools for training, inference, and deployment, supports heterogeneous hardware acceleration, and enables developers to easily build intelligent document applications.
这份技术报告介绍了PaddleOCR 3.0,这是一个Apache许可的开源工具包,用于光学字符识别(OCR)和文档解析。为了满足大语言模型时代对文档理解日益增长的需求,PaddleOCR 3.0提出了三大解决方案:(1) PP-OCRv5用于多语种文本识别;(2) PP-StructureV3用于分层文档解析;以及(3) PP-ChatOCRv4用于关键信息提取。与主流的视觉语言模型(VLMs)相比,这些参数量少于1亿的模型在准确率和效率方面表现出竞争力,可以媲美十亿级参数的VLMs。除了提供高质量的OCR模型库之外,PaddleOCR 3.0还提供了高效的训练、推理和部署工具,支持异构硬件加速,并使开发者能够轻松构建智能文档应用程序。
https://arxiv.org/abs/2507.05595
The proliferation of AI-driven systems presents a fundamental challenge to Human-Computer Interaction (HCI) and Computer-Supported Cooperative Work (CSCW), often diminishing user agency and failing to account for value pluralism. Current approaches to value alignment, which rely on centralized, top-down definitions, lack the mechanisms for meaningful contestability. This leaves users and communities unable to challenge or shape the values embedded in the systems that govern their digital lives, creating a crisis of legitimacy and trust. This paper introduces Community-Defined AI Value Pluralism (CDAVP), a socio-technical framework that addresses this gap. It reframes the design problem from achieving a single aligned state to infrastructuring a dynamic ecosystem for value deliberation and application. At its core, CDAVP enables diverse, self-organizing communities to define and maintain explicit value profiles - rich, machine-readable representations that can encompass not only preferences but also community-specific rights and duties. These profiles are then contextually activated by the end-user, who retains ultimate control (agency) over which values guide the AI's behavior. AI applications, in turn, are designed to transparently interpret these profiles and moderate conflicts, adhering to a set of non-negotiable, democratically-legitimated meta-rules. The designer's role shifts from crafting static interfaces to becoming an architect of participatory ecosystems. We argue that infrastructuring for pluralism is a necessary pathway toward achieving robust algorithmic accountability and genuinely contestable, human-centric AI.
人工智能系统的激增给以人为本的交互设计(HCI)和计算机支持的协同工作(CSCW)带来了根本性的挑战,这些系统往往削弱了用户的自主权,并未能考虑到多元价值的存在。当前的价值对齐方法依赖于集中化的自上而下定义,缺乏有意义的可质疑机制。这使得用户和社区无法挑战或塑造影响他们数字生活的系统的内在价值观,从而导致合法性和信任危机。本文介绍了“社群定义的人工智能多元价值观”(Community-Defined AI Value Pluralism, CDAVP),这是一个解决这一缺口的社会技术框架。 CDAVP将设计问题从实现单一的价值对齐状态重新构想为基础设施化一个动态生态系统以促进价值讨论和应用。其核心是,CDAVP使多样化的、自我组织的社群能够定义并维护明确的价值配置文件——这些丰富的机器可读表示不仅可以涵盖偏好,还可以包括社区特有的权利与义务。用户随后根据具体情况激活这些配置文件,并保持最终控制权(自主性),决定哪些价值观指导AI的行为。 同时,AI应用程序被设计为透明地解释这些配置文件,并解决冲突,遵守一系列不可谈判的、通过民主方式合法化的元规则。设计师的角色从创建静态界面转向成为参与式生态系统的建筑师。我们认为,为了实现强有力的算法问责制和真正可质疑的人本主义人工智能,多元价值观基础设施化是必经之路。
https://arxiv.org/abs/2507.05187
Historical documents represent an invaluable cultural heritage, yet have undergone significant degradation over time through tears, water erosion, and oxidation. Existing Historical Document Restoration (HDR) methods primarily focus on single modality or limited-size restoration, failing to meet practical needs. To fill this gap, we present a full-page HDR dataset (FPHDR) and a novel automated HDR solution (AutoHDR). Specifically, FPHDR comprises 1,633 real and 6,543 synthetic images with character-level and line-level locations, as well as character annotations in different damage grades. AutoHDR mimics historians' restoration workflows through a three-stage approach: OCR-assisted damage localization, vision-language context text prediction, and patch autoregressive appearance restoration. The modular architecture of AutoHDR enables seamless human-machine collaboration, allowing for flexible intervention and optimization at each restoration stage. Experiments demonstrate AutoHDR's remarkable performance in HDR. When processing severely damaged documents, our method improves OCR accuracy from 46.83\% to 84.05\%, with further enhancement to 94.25\% through human-machine collaboration. We believe this work represents a significant advancement in automated historical document restoration and contributes substantially to cultural heritage preservation. The model and dataset are available at this https URL.
历史文档代表了无价的文化遗产,但由于撕裂、水侵蚀和氧化等因素的影响,这些文献已经经历了严重的退化。现有的历史文档修复(HDR)方法主要集中在单模态或有限尺寸的修复上,未能满足实际需求。为了填补这一空白,我们提出了一套全页HDR数据集(FPHDR)以及一种新颖的自动HDR解决方案(AutoHDR)。 具体来说,FPHDR包含1,633张真实图像和6,543张合成图像,并且这些图像是以字符级别和行级别位置进行标注的,同时在不同程度的损坏等级下还附有字符注释。AutoHDR通过一个三阶段的方法模仿了历史学家的修复工作流程:OCR辅助的损伤定位、视觉-语言上下文文本预测以及补丁自回归外观恢复。AutoHDR的模块化架构使得人机协作无缝对接,在每一步修复过程中都能够灵活介入和优化。 实验表明,AutoHDR在HDR中表现出色。当处理严重损坏的文档时,我们的方法将OCR准确性从46.83%提高到84.05%,通过人机合作进一步提升至94.25%。我们认为这项工作代表了自动化历史文档修复的重要进展,并且对文化遗产保护做出了重大贡献。模型和数据集可以在提供的链接中找到:[此处插入实际的网址,原文为"this https URL"]。 (请将最后的数据集链接替换为具体可访问的实际地址)
https://arxiv.org/abs/2507.05108
This article presents the experiments and results obtained by the GRESEL team in the IberLEF 2025 shared task PastReader: Transcribing Texts from the Past. Three types of experiments were conducted with the dual aim of participating in the task and enabling comparisons across different approaches. These included the use of a web-based OCR service, a traditional OCR engine, and a compact multimodal model. All experiments were run on consumer-grade hardware, which, despite lacking high-performance computing capacity, provided sufficient storage and stability. The results, while satisfactory, leave room for further improvement. Future work will focus on exploring new techniques and ideas using the Spanish-language dataset provided by the shared task, in collaboration with Biblioteca Nacional de España (BNE).
本文介绍了GRESEL团队在2025年IberLEF共享任务PastReader:转录过去的文本中所进行的实验及其结果。这些实验旨在参与该任务并使不同方法之间的比较成为可能,共进行了三种类型的实验:使用基于网络的OCR服务、传统的OCR引擎和紧凑型多模态模型。所有实验均在消费级硬件上运行,尽管这种硬件缺乏高性能计算能力,但提供了足够的存储空间和稳定性。虽然结果令人满意,但仍留有改进的空间。未来的工作将重点探索利用共享任务提供的西班牙语数据集,并与西班牙国家图书馆(BNE)合作研究新的技术和理念。
https://arxiv.org/abs/2507.04878
Tombstones are historically and culturally rich artifacts, encapsulating individual lives, community memory, historical narratives and artistic expression. Yet, many tombstones today face significant preservation challenges, including physical erosion, vandalism, environmental degradation, and political shifts. In this paper, we introduce a novel multi-modal framework for tombstones digitization, aiming to improve the interpretation, organization and retrieval of tombstone content. Our approach leverages vision-language models (VLMs) to translate tombstone images into structured Tombstone Meaning Representations (TMRs), capturing both image and text information. To further enrich semantic parsing, we incorporate retrieval-augmented generation (RAG) for integrate externally dependent elements such as toponyms, occupation codes, and ontological concepts. Compared to traditional OCR-based pipelines, our method improves parsing accuracy from an F1 score of 36.1 to 89.5. We additionally evaluate the model's robustness across diverse linguistic and cultural inscriptions, and simulate physical degradation through image fusion to assess performance under noisy or damaged conditions. Our work represents the first attempt to formalize tombstone understanding using large vision-language models, presenting implications for heritage preservation.
墓碑是历史和文化丰富的文物,它们记录了个人的生命、社区的记忆、历史叙述以及艺术表现。然而,许多当代的墓碑面临着严重的保存挑战,包括物理侵蚀、人为破坏、环境退化及政治变动。本文介绍了一种创新的多模态框架,用于墓碑数字化,旨在提升对墓碑内容的理解、组织和检索能力。我们的方法利用视觉-语言模型(VLMs)将墓碑图像转化为结构化的“墓碑意义表示”(TMR),捕捉图像与文本信息。为了进一步丰富语义解析,我们整合了检索增强生成(RAG)技术,以纳入如地名、职业代码和本体论概念等外部相关元素。 相较于传统的OCR(光学字符识别)管道方法,我们的方法在F1分数上从36.1显著提升到了89.5。此外,我们还评估了模型在语言和文化书写形式多样情况下的鲁棒性,并通过图像融合模拟物理退化来测试其在存在噪音或损坏条件下的性能表现。 本研究是首次尝试利用大规模视觉-语言模型对墓碑的理解进行规范化处理,为遗产保护提供了新的可能性。
https://arxiv.org/abs/2507.04377