Pre-trained language models (LMs) have, over the last few years, grown substantially in both societal adoption and training costs. This rapid growth in size has constrained progress in understanding and mitigating their biases. Since re-training LMs is prohibitively expensive, most debiasing work has focused on post-hoc or masking-based strategies, which often fail to address the underlying causes of bias. In this work, we seek to democratise pre-model debiasing research by using low-cost proxy models. Specifically, we investigate BabyLMs, compact BERT-like models trained on small and mutable corpora that can approximate bias acquisition and learning dynamics of larger models. We show that BabyLMs display closely aligned patterns of intrinsic bias formation and performance development compared to standard BERT models, despite their drastically reduced size. Furthermore, correlations between BabyLMs and BERT hold across multiple intra-model and post-model debiasing methods. Leveraging these similarities, we conduct pre-model debiasing experiments with BabyLMs, replicating prior findings and presenting new insights regarding the influence of gender imbalance and toxicity on bias formation. Our results demonstrate that BabyLMs can serve as an effective sandbox for large-scale LMs, reducing pre-training costs from over 500 GPU-hours to under 30 GPU-hours. This provides a way to democratise pre-model debiasing research and enables faster, more accessible exploration of methods for building fairer LMs.
近年来,预训练语言模型(LMs)在社会应用和培训成本方面都有了显著的增长。这种快速扩大规模的情况限制了对它们偏见的理解和缓解工作。由于重新训练大型语言模型的成本过高,大多数消除偏见的研究主要集中在事后处理或基于掩码的方法上,这些方法往往无法解决偏见的根本原因。在这项工作中,我们试图通过使用低成本的代理模型来使预模型去偏研究民主化。具体而言,我们调查了BabyLMs,这是一种小型、可变语料库训练的紧凑型BERT类似模型,可以近似大型模型获取和学习偏见的动力学。 我们的研究表明,尽管BabyLMs的大小大幅减少,但它们在内在偏见形成和发展方面的表现与标准的BERT模型非常一致。此外,BabyLMs与BERT之间的相关性在多种内源去偏方法和事后处理去偏策略上都存在。利用这些相似之处,我们使用BabyLMs进行了预训练去偏实验,并复制了先前的研究发现,同时提出了关于性别不平衡和毒性对偏见形成影响的新见解。 我们的结果表明,BabyLMs可以作为大规模语言模型的有效试验平台,将预训练成本从超过500个GPU小时降低到不到30个GPU小时。这为预模型去偏研究的民主化提供了一种途径,并使构建更公平的语言模型的方法探索更加迅速且易于访问。
https://arxiv.org/abs/2601.09421
We describe a novel system, CSQL, which automatically converts a collection of unstructured text documents into an SQL-queryable causal database (CDB). A CDB differs from a traditional DB: it is designed to answer "why'' questions via causal interventions and structured causal queries. CSQL builds on our earlier system, DEMOCRITUS, which converts documents into thousands of local causal models derived from causal discourse. Unlike RAG-based systems or knowledge-graph based approaches, CSQL supports causal analysis over document collections rather than purely associative retrieval. For example, given an article on the origins of human bipedal walking, CSQL enables queries such as: "What are the strongest causal influences on bipedalism?'' or "Which variables act as causal hubs with the largest downstream influence?'' Beyond single-document case studies, we show that CSQL can also ingest RAG/IE-compiled causal corpora at scale by compiling the Testing Causal Claims (TCC) dataset of economics papers into a causal database containing 265,656 claim instances spanning 45,319 papers, 44 years, and 1,575 reported method strings, thereby enabling corpus-level causal queries and longitudinal analyses in CSQL. Viewed abstractly, CSQL functions as a compiler from unstructured documents into a causal database equipped with a principled algebra of queries, and can be applied broadly across many domains ranging from business, humanities, and science.
我们描述了一个新颖的系统CSQL,该系统可以自动将一组非结构化的文本文件转换为可通过SQL查询的因果数据库(CDB)。与传统的数据库不同,因果数据库旨在通过因果干预和结构化因果查询来回答“为什么”类型的问题。CSQL基于我们早期开发的DEMOCRITUS系统,在文档中生成数千个局部因果模型。不同于RAG或知识图谱方法,CSQL支持对一系列文档进行因果分析,而不仅仅是关联检索。例如,对于一篇关于人类双足行走起源的文章,CSQL可以回答诸如“哪些因素最强烈地影响了双足行走的形成?”或者“哪些变量作为具有最大下游影响力的因果中心起作用?”等问题。 除了单个文档案例研究外,我们还展示了CSQL可以通过大规模编译RAG/IE生成的因果语料库来处理整个集合。例如,通过将经济学论文中的测试因果主张(TCC)数据集编译为一个包含265,656个断言实例、跨越45,319篇论文、44年和1,575种报告方法的因果数据库,CSQL能够支持语料库级别的因果查询以及纵向分析。从抽象角度来看,CSQL可视为将非结构化文档编译为一个具备原则性查询代数的因果数据库的功能,并可在商业、人文科学及自然科学等多个领域广泛使用。
https://arxiv.org/abs/2601.08109
Automatic License Plate Recognition is a frequent research topic due to its wide-ranging practical applications. While recent studies use synthetic images to improve License Plate Recognition (LPR) results, there remain several limitations in these efforts. This work addresses these constraints by comprehensively exploring the integration of real and synthetic data to enhance LPR performance. We subject 16 Optical Character Recognition (OCR) models to a benchmarking process involving 12 public datasets acquired from various regions. Several key findings emerge from our investigation. Primarily, the massive incorporation of synthetic data substantially boosts model performance in both intra- and cross-dataset scenarios. We examine three distinct methodologies for generating synthetic data: template-based generation, character permutation, and utilizing a Generative Adversarial Network (GAN) model, each contributing significantly to performance enhancement. The combined use of these methodologies demonstrates a notable synergistic effect, leading to end-to-end results that surpass those reached by state-of-the-art methods and established commercial systems. Our experiments also underscore the efficacy of synthetic data in mitigating challenges posed by limited training data, enabling remarkable results to be achieved even with small fractions of the original training data. Finally, we investigate the trade-off between accuracy and speed among different models, identifying those that strike the optimal balance in each intra-dataset and cross-dataset settings.
自动车牌识别(Automatic License Plate Recognition,ALPR)是一个由于其实用性广泛而备受研究的课题。尽管最近的研究使用合成图像来改进车牌识别(License Plate Recognition,LPR)结果,但仍存在一些限制。这项工作通过全面探索真实数据和合成数据的整合来解决这些限制,以提高LPR性能。我们对16种光学字符识别(OCR)模型进行了基准测试,涉及从不同地区获取的12个公共数据集。我们的研究得出了一些关键发现: 首先,大量使用合成数据显著提升了模型在内部分布和跨分布场景中的表现。 其次,我们探讨了三种不同的生成合成数据的方法:基于模板的生成、字符排列以及利用生成对抗网络(GAN)模型,每种方法都对性能提升有重大贡献。这几种方法结合使用的综合效果明显,使得从端到端的结果超越现有的最先进的方法和已建立的商业系统。 此外,我们的实验强调了合成数据在缓解由于训练数据量有限带来的挑战方面的有效性,在使用少量原始训练数据的情况下也能获得显著结果。 最后,我们研究了不同模型之间准确性和速度之间的权衡,并根据每个内部分布和跨分布场景识别出达到最佳平衡的那些模型。
https://arxiv.org/abs/2601.07671
Document layout analysis aims to detect and categorize structural elements (e.g., titles, tables, figures) in scanned or digital documents. Popular methods often rely on high-quality Optical Character Recognition (OCR) to merge visual features with extracted text. This dependency introduces two major drawbacks: propagation of text recognition errors and substantial computational overhead, limiting the robustness and practical applicability of multimodal approaches. In contrast to the prevailing multimodal trend, we argue that effective layout analysis depends not on text-visual fusion, but on a deep understanding of documents' intrinsic visual structure. To this end, we propose PARL (Position-Aware Relation Learning Network), a novel OCR-free, vision-only framework that models layout through positional sensitivity and relational structure. Specifically, we first introduce a Bidirectional Spatial Position-Guided Deformable Attention module to embed explicit positional dependencies among layout elements directly into visual features. Second, we design a Graph Refinement Classifier (GRC) to refine predictions by modeling contextual relationships through a dynamically constructed layout graph. Extensive experiments show PARL achieves state-of-the-art results. It establishes a new benchmark for vision-only methods on DocLayNet and, notably, surpasses even strong multimodal models on M6Doc. Crucially, PARL (65M) is highly efficient, using roughly four times fewer parameters than large multimodal models (256M), demonstrating that sophisticated visual structure modeling can be both more efficient and robust than multimodal fusion.
文档布局分析的目标是检测和分类扫描或数字文档中的结构元素(例如,标题、表格、图片等)。现有流行的方法通常依赖于高质量的光学字符识别(OCR)技术,将视觉特征与提取出来的文本结合起来。这种依赖性带来了两个主要缺点:文字识别错误的传播以及计算开销过大,这限制了多模态方法的鲁棒性和实用性。 不同于当前流行的多模态趋势,我们认为有效的布局分析并不需要文字和图像的信息融合,而是基于对文档内在视觉结构的深刻理解。为此,我们提出了PARL(位置感知关系学习网络),这是一种新颖的、不依赖OCR的纯视觉框架,通过引入位置敏感性和关联性来建模布局。 具体来说,我们首先提出了一种双向空间位置引导的变形注意力模块,将布局元素之间的显式位置依存关系直接嵌入到视觉特征中。其次,设计了一个图细化分类器(GRC),该分类器通过动态构建的布局图模型上下文关系来优化预测结果。 大量的实验表明PARL实现了最先进的效果,在DocLayNet数据集上为纯视觉方法建立了新的基准,并且在M6Doc数据集中甚至超过了强大的多模态模型。特别重要的是,PARL(具有约6500万个参数)非常高效,使用的参数仅为大规模的多模态模型(大约2.5亿个参数)的四分之一左右。这表明复杂视觉结构建模可以比多模态融合更为有效且鲁棒。
https://arxiv.org/abs/2601.07620
Multimodal fake news detection is crucial for mitigating adversarial misinformation. Existing methods, relying on static fusion or LLMs, face computational redundancy and hallucination risks due to weak visual foundations. To address this, we propose DIVER (Dynamic Iterative Visual Evidence Reasoning), a framework grounded in a progressive, evidence-driven reasoning paradigm. DIVER first establishes a strong text-based baseline through language analysis, leveraging intra-modal consistency to filter unreliable or hallucinated claims. Only when textual evidence is insufficient does the framework introduce visual information, where inter-modal alignment verification adaptively determines whether deeper visual inspection is necessary. For samples exhibiting significant cross-modal semantic discrepancies, DIVER selectively invokes fine-grained visual tools (e.g., OCR and dense captioning) to extract task-relevant evidence, which is iteratively aggregated via uncertainty-aware fusion to refine multimodal reasoning. Experiments on Weibo, Weibo21, and GossipCop demonstrate that DIVER outperforms state-of-the-art baselines by an average of 2.72\%, while optimizing inference efficiency with a reduced latency of 4.12 s.
多模态虚假新闻检测对于缓解敌对的误导信息至关重要。现有的方法依赖于静态融合或大型语言模型(LLMs),由于视觉基础薄弱,导致计算冗余和幻觉风险。为了解决这些问题,我们提出了DIVER(动态迭代视觉证据推理)框架,该框架基于逐步、以证据为中心的推理范式建立。 DIVER首先通过语言分析来确立一个强大的文本基线,并利用模态内一致性来筛选不可靠或有幻觉效应的说法。只有当文本证据不足时,才引入视觉信息,在这种情况下,跨模态对齐验证会自适应地决定是否需要进行更深入的视觉检查。 对于存在显著跨模态语义差异的样本,DIVER会选择性地调用细粒度的视觉工具(例如OCR和密集描述),以提取与任务相关的证据。这些证据通过不确定感知融合迭代聚合,从而细化多模态推理过程。 在Weibo、Weibo21和GossipCop等数据集上的实验表明,DIVER比最先进的基线方法平均高出2.72%,同时将推理效率优化了4.12秒的延迟减少。
https://arxiv.org/abs/2601.07178
Biomedical Question Answering systems play a critical role in processing complex medical queries, yet they often struggle with the intricate nature of medical data and the demand for multi-hop reasoning. In this paper, we propose a model designed to effectively address both direct and sequential questions. While sequential questions are decomposed into a chain of sub-questions to perform reasoning across a chain of steps, direct questions are processed directly to ensure efficiency and minimise processing overhead. Additionally, we leverage multi-source information retrieval and in-context learning to provide rich, relevant context for generating answers. We evaluated our model on the BioCreative IX - MedHopQA Shared Task datasets. Our approach achieves an Exact Match score of 0.84, ranking second on the current leaderboard. These results highlight the model's capability to meet the challenges of Biomedical Question Answering, offering a versatile solution for advancing medical research and practice.
生物医学问答系统在处理复杂的医疗查询中扮演着关键角色,然而它们经常难以应对医学数据的复杂性和多步推理的需求。本文提出了一种模型,旨在有效解决直接问题和顺序问题。对于顺序问题,该模型将其分解为一系列子问题链来执行跨步骤的推理;而对于直接问题,则直接处理以确保效率并最小化处理开销。此外,我们利用了多源信息检索和上下文学习技术,以提供生成答案所需的丰富且相关的信息背景。 我们在BioCreative IX - MedHopQA共享任务数据集上评估了该模型。我们的方法在精确匹配得分上达到了0.84,在当前排行榜中位列第二。这些结果突显了该模型能够应对生物医学问答系统的挑战,并为推进医疗研究和实践提供了一种灵活的解决方案。
https://arxiv.org/abs/2601.06974
Multi-objective alignment for text-to-image generation is commonly implemented via static linear scalarization, but fixed weights often fail under heterogeneous rewards, leading to optimization imbalance where models overfit high-variance, high-responsiveness objectives (e.g., OCR) while under-optimizing perceptual goals. We identify two mechanistic causes: variance hijacking, where reward dispersion induces implicit reweighting that dominates the normalized training signal, and gradient conflicts, where competing objectives produce opposing update directions and trigger seesaw-like oscillations. We propose APEX (Adaptive Priority-based Efficient X-objective Alignment), which stabilizes heterogeneous rewards with Dual-Stage Adaptive Normalization and dynamically schedules objectives via P^3 Adaptive Priorities that combine learning potential, conflict penalty, and progress need. On Stable Diffusion 3.5, APEX achieves improved Pareto trade-offs across four heterogeneous objectives, with balanced gains of +1.31 PickScore, +0.35 DeQA, and +0.53 Aesthetics while maintaining competitive OCR accuracy, mitigating the instability of multi-objective alignment.
多目标对齐在文本到图像生成中的常见实现方式是静态线性标量化,但固定的权重往往在面对异质奖励时失效,导致优化不平衡。在这种情况下,模型会过度拟合高方差、响应度高的目标(例如OCR),而忽视感知目标的优化。我们识别了两个机制原因:一是方差劫持,在这种情况下,奖励分布会产生隐式的重新加权,从而主导标准化训练信号;二是梯度冲突,即竞争性目标会产生相反的更新方向,并引发类似跷跷板似的振荡。 为解决这些问题,我们提出了APEX(基于自适应优先级的高效多目标对齐方法),通过双阶段自适应归一化来稳定异质奖励,并根据学习潜力、冲突惩罚和进步需求动态调度目标。在Stable Diffusion 3.5上,APEX实现了四个异质目标之间的帕累托权衡优化改进,在保持OCR准确性的同时,提升了PickScore(+1.31)、DeQA(+0.35)和美学得分(+0.53),从而缓解了多目标对齐的不稳定性。
https://arxiv.org/abs/2601.06574
The strength of democracy lies in the free and equal exchange of diverse viewpoints. Living up to this ideal at scale faces inherent tensions: broad participation, meaningful deliberation, and political equality often trade off with one another (Fishkin, 2011). We ask whether and how artificial intelligence (AI) could help navigate this "trilemma" by engaging with a recent example of a large language model (LLM)-based system designed to help people with diverse viewpoints find common ground (Tessler, Bakker, et al., 2024). Here, we explore the implications of the introduction of LLMs into deliberation augmentation tools, examining their potential to enhance participation through scalability, improve political equality via fair mediation, and foster meaningful deliberation by, for example, surfacing trustworthy information. We also point to key challenges that remain. Ultimately, a range of empirical, technical, and theoretical advancements are needed to fully realize the promise of AI-mediated deliberation for enhancing citizen engagement and strengthening democratic deliberation.
民主的力量在于自由和平等地交流各种不同的观点。然而,在实践中实现这一理想面临着内在的紧张关系:广泛的参与、有意义的讨论和政治平等往往相互之间存在取舍(Fishkin, 2011)。我们探讨了人工智能(AI)是否以及如何通过与最近一个基于大型语言模型(LLM)的系统合作来帮助解决这种“三难困境”,该系统旨在帮助持有不同观点的人找到共识(Tessler, Bakker等人,2024)。在这里,我们研究将LLMs引入讨论增强工具后的潜在影响,探讨它们如何通过提高可扩展性来促进参与度,通过公平调解来改善政治平等,并通过提供可信的信息等方式推动有意义的讨论。同时,我们也指出了仍需解决的关键挑战。最终,为了实现AI在加强公民参与和强化民主讨论方面的潜力,我们需要一系列的经验、技术及理论上的进步。
https://arxiv.org/abs/2601.05904
Municipal meeting minutes record key decisions in local democratic processes. Unlike parliamentary proceedings, which typically adhere to standardized formats, they encode voting outcomes in highly heterogeneous, free-form narrative text that varies widely across municipalities, posing significant challenges for automated extraction. In this paper, we introduce VotIE (Voting Information Extraction), a new information extraction task aimed at identifying structured voting events in narrative deliberative records, and establish the first benchmark for this task using Portuguese municipal minutes, building on the recently introduced CitiLink corpus. Our experiments yield two key findings. First, under standard in-domain evaluation, fine-tuned encoders, specifically XLM-R-CRF, achieve the strongest performance, reaching 93.2\% macro F1, outperforming generative approaches. Second, in a cross-municipality setting that evaluates transfer to unseen administrative contexts, these models suffer substantial performance degradation, whereas few-shot LLMs demonstrate greater robustness, with significantly smaller declines in performance. Despite this generalization advantage, the high computational cost of generative models currently constrains their practicality. As a result, lightweight fine-tuned encoders remain a more practical option for large-scale, real-world deployment. To support reproducible research in administrative NLP, we publicly release our benchmark, trained models, and evaluation framework.
市政会议纪要记录了地方民主进程中关键的决策。与遵循标准化格式的议会程序不同,它们将投票结果编码为高度异质、自由形式的文字叙述,这种形式在各个市镇之间差异很大,给自动提取带来了重大挑战。本文介绍了VotIE(投票信息抽取),这是一个新的信息抽取任务,旨在从叙述性的审议记录中识别结构化的投票事件,并使用葡萄牙语市政会议纪要建立了第一个基准测试集,这是基于最近引入的CitiLink语料库建立的。我们的实验得出了两个关键发现。首先,在标准领域内评估下,经过微调的编码器(特别是XLM-R-CRF)表现出最强性能,达到了93.2%的宏平均F1值,并超过了生成式方法。其次,在跨市镇设置中进行评估时,这些模型在面对未见过的行政环境时表现显著下降,而少量样本训练的大规模语言模型则展示了更大的稳健性,其性能下降幅度要小得多。尽管存在这种泛化优势,当前高计算成本仍然是使用生成式模型的实际限制因素。因此,在大规模实际部署中,轻量级微调编码器仍是更可行的选择。 为了支持行政自然语言处理领域的可重复研究,我们公开发布了我们的基准测试、训练好的模型和评估框架。
https://arxiv.org/abs/2601.03997
DeepSeek-OCR utilizes an optical 2D mapping approach to achieve high-ratio vision-text compression, claiming to decode text tokens exceeding ten times the input visual tokens. While this suggests a promising solution for the LLM long-context bottleneck, we investigate a critical question: "Visual merit or linguistic crutch - which drives DeepSeek-OCR's performance?" By employing sentence-level and word-level semantic corruption, we isolate the model's intrinsic OCR capabilities from its language priors. Results demonstrate that without linguistic support, DeepSeek-OCR's performance plummets from approximately 90% to 20%. Comparative benchmarking against 13 baseline models reveals that traditional pipeline OCR methods exhibit significantly higher robustness to such semantic perturbations than end-to-end methods. Furthermore, we find that lower visual token counts correlate with increased reliance on priors, exacerbating hallucination risks. Context stress testing also reveals a total model collapse around 10,000 text tokens, suggesting that current optical compression techniques may paradoxically aggravate the long-context bottleneck. This study empirically defines DeepSeek-OCR's capability boundaries and offers essential insights for future optimizations of the vision-text compression paradigm. We release all data, results and scripts used in this study at this https URL.
DeepSeek-OCR 使用光学2D映射方法来实现高比率的视觉文本压缩,声称能够解码超过输入视觉标记十倍的文字标记。尽管这为大模型长上下文瓶颈提供了一个有前景的解决方案,但我们调查了一个关键问题:“是图像优势还是语言拐杖推动了 DeepSeek-OCR 的表现?”通过使用句子级和单词级语义破坏,我们将模型内在的光学字符识别(OCR)能力与其语言先验区分开来。结果表明,在没有语言支持的情况下,DeepSeek-OCR 的性能从约90%骤降至20%。 在与13个基准模型进行对比测试时发现,传统的管道OCR方法比端到端的方法对这种语义干扰表现出更高的鲁棒性。此外,我们还发现较低的视觉标记数量与更强的语言先验依赖相关联,从而增加了幻觉风险。上下文压力测试也揭示了大约在10,000个文本令牌时模型整体崩溃的情况,这表明当前的光学压缩技术可能反而加剧了长上下文瓶颈问题。 这项研究从经验上界定了 DeepSeek-OCR 的能力范围,并为未来视觉-文本压缩范式的优化提供了重要见解。我们在[此处提供URL]发布了本研究中使用的所有数据、结果和脚本。
https://arxiv.org/abs/2601.03714
Engineering education faces a double disruption: traditional apprenticeship models that cultivated judgment and tacit skill are eroding, just as generative AI emerges as an informal coaching partner. This convergence rekindles long-standing questions in the philosophy of AI and cognition about the limits of computation, the nature of embodied rationality, and the distinction between information processing and wisdom. Building on this rich intellectual tradition, this paper examines whether AI chatbots can provide coaching that fosters mastery rather than merely delivering information. We synthesize critical perspectives from decades of scholarship on expertise, tacit knowledge, and human-machine interaction, situating them within the context of contemporary AI-driven education. Empirically, we report findings from a mixed-methods study (N = 75 students, N = 7 faculty) exploring the use of a coaching chatbot in engineering education. Results reveal a consistent boundary: participants accept AI for technical problem solving (convergent tasks; M = 3.84 on a 1-5 Likert scale) but remain skeptical of its capacity for moral, emotional, and contextual judgment (divergent tasks). Faculty express stronger concerns over risk (M = 4.71 vs. M = 4.14, p = 0.003), and privacy emerges as a key requirement, with 64-71 percent of participants demanding strict confidentiality. Our findings suggest that while generative AI can democratize access to cognitive and procedural support, it cannot replicate the embodied, value-laden dimensions of human mentorship. We propose a multiplex coaching framework that integrates human wisdom within expert-in-the-loop models, preserving the depth of apprenticeship while leveraging AI scalability to enrich the next generation of engineering education.
工程教育正面临着双重挑战:传统的学徒制模式正在衰退,这种模式培养了判断力和隐性技能;同时,生成式人工智能作为非正式的辅导伙伴出现。这种交汇点重新激发了关于计算极限、具身理性本质以及信息处理与智慧区别的长期哲学问题。基于这一丰富的学术传统,本文探讨了AI聊天机器人是否能够提供促进掌握而非仅仅传递信息的辅导。我们综合了几十年来关于专业知识、隐性知识和人机交互的关键观点,并将这些观点置于当代人工智能驱动教育的背景下进行分析。 从实证角度来看,报告了一项多方法研究的结果(75名学生参与,7位教师),探讨了在工程教育中使用指导聊天机器人的应用。结果显示了一个一致性的界限:参与者认为AI可以用于技术问题解决(聚合任务;平均分 = 3.84,按1-5的李克特量表评定)但对其处理道德、情感和情境判断的能力持怀疑态度(发散性任务)。教师对风险的担忧更为强烈(平均分 = 4.71 对比 4.14,p=0.003),隐私成为了一个关键需求,64-71%的参与者要求严格的保密措施。 我们的发现表明,尽管生成式AI能够使认知和程序支持民主化,但它无法复制人类指导中具身性和价值观层面的内容。我们提出了一种多层辅导框架,在专家在环模式内整合人的智慧,保留学徒制的深度同时利用AI的可扩展性来丰富下一代工程教育。
https://arxiv.org/abs/2601.03693
Binarization is a popular first step towards text extraction in historical artifacts. Stone inscription images pose severe challenges for binarization due to poor contrast between etched characters and the stone background, non-uniform surface degradation, distracting artifacts, and highly variable text density and layouts. These conditions frequently cause existing binarization techniques to fail and struggle to isolate coherent character regions. Many approaches sub-divide the image into patches to improve text fragment resolution and improve binarization performance. With this in mind, we present a robust and adaptive patching strategy to binarize challenging Indic inscriptions. The patches from our approach are used to train an Attention U-Net for binarization. The attention mechanism allows the model to focus on subtle structural cues, while our dynamic sampling and patch selection method ensures that the model learns to overcome surface noise and layout irregularities. We also introduce a carefully annotated, pixel-precise dataset of Indic stone inscriptions at the character-fragment level. We demonstrate that our novel patching mechanism significantly boosts binarization performance across classical and deep learning baselines. Despite training only on single script Indic dataset, our model exhibits strong zero-shot generalization to other Indic and non-indic scripts, highlighting its robustness and script-agnostic generalization capabilities. By producing clean, structured representations of inscription content, our method lays the foundation for downstream tasks such as script identification, OCR, and historical text analysis. Project page: this https URL
文本二值化是历史文物中提取文字信息的常用第一步。然而,石刻图像由于刻画字符与石头背景之间的对比度差、非均匀表面退化、干扰性附带物以及高度变化的文字密度和布局等原因,在二值化方面带来了巨大的挑战。这些条件常常导致现有的二值化技术失败,并且难以分离出连贯的字符区域。许多方法通过将图像分割成多个小块来提高文字片段解析能力和优化二值化性能。考虑到这一点,我们提出了一种针对具有挑战性的印度语石刻进行二值化的稳健而灵活的分块策略。我们的方法所生成的小块被用于训练一个注意力U-Net模型以进行二值化处理。通过引入注意机制,该模型能够专注于细微结构线索;同时,动态采样和小块选择的方法确保了模型能够学习到克服表面噪声和布局不规则的能力。此外,我们还提出了一套精心标注、像素级准确的印度语石刻数据集,在字符片段级别上进行划分。我们在实验中展示了我们的创新分块机制在经典方法及深度学习基线上的二值化性能上有显著提升。尽管模型仅使用单一文字的印度语数据集进行了训练,它却表现出强大的零样本泛化能力到其他印度语和非印度语文字,这彰显了其鲁棒性和对不同文字的无差别泛化能力。通过提供清晰、结构化的石刻内容表示,我们的方法为下游任务如脚本识别、光学字符识别(OCR)以及历史文本分析奠定了基础。 项目页面:[点击这里](https://this-url.com/)
https://arxiv.org/abs/2601.03609
Vision-Language Models (VLMs) have achieved strong performance on standard vision-language benchmarks, yet often rely on surface-level recognition rather than deeper reasoning. We propose visual word puzzles as a challenging alternative, as they require discovering implicit visual cues, generating and revising hypotheses, and mapping perceptual evidence to non-literal concepts in ways that are difficult to solve via literal grounding, OCR-heavy shortcuts, or simple retrieval-style matching. We introduce Eye-Q, a multilingual benchmark designed to assess this form of complex visual understanding. Eye-Q contains 1,343 puzzles in which a model observes a conceptually dense scene with a brief description and must infer a specific target word or phrase. The puzzles are intentionally unstructured and cue-implicit, with distractors and contextual relationships that demand selective attention, abstraction, and associative inference. The benchmark spans English, Persian, Arabic, and cross-lingual puzzles. We evaluate state-of-the-art VLMs using an open-ended, human-aligned protocol that probes hypothesis formation and revision under lightweight assistance. Results reveal substantial performance gaps, especially on abstract and cross-lingual puzzles, highlighting limitations in current models' ability to construct and search over appropriate conceptual representations for flexible image-to-phrase inference; maximum accuracy reaches only 60.27%.
视觉语言模型(VLMs)在标准的视觉语言基准测试中取得了优异的成绩,但通常依赖于表层识别而非深层次推理。我们提出了一种基于视觉词谜的更具挑战性的替代方案,因为这种任务需要发现隐含的视觉线索、生成和修订假设,并以难以通过字面理解、OCR密集型捷径或简单的检索式匹配来解决的方式将感知证据映射到非字面概念上。为此,我们设计了一个名为Eye-Q的多语言基准测试,旨在评估这种复杂的视觉理解能力。Eye-Q包含1,343个谜题,在这些谜题中,模型需要观察一个概念密集的画面,并结合简短的描述来推断出特定的目标单词或词组。这些谜题故意缺乏结构且隐含线索,伴有干扰项和上下文关系,要求模型具备选择性注意力、抽象思维和联想推理的能力。该基准测试涵盖英语、波斯语、阿拉伯语以及跨语言的谜题。 我们采用了一个开放式的、与人类判断一致的协议来评估最先进的VLMs,以检测假设形成和修订过程中的轻量级辅助效应。结果表明,在抽象和跨语言谜题上存在显著性能差距,这突显了当前模型在构建和搜索适当概念表示的能力上的局限性,使其难以进行灵活的图像到短语推断;最高准确率仅达到60.27%。
https://arxiv.org/abs/2601.03400
Multimodal large language models (MLLMs) typically rely on a single late-layer feature from a frozen vision encoder, leaving the encoder's rich hierarchy of visual cues under-utilized. MLLMs still suffer from visually ungrounded hallucinations, often relying on language priors rather than image evidence. While many prior mitigation strategies operate on the text side, they leave the visual representation unchanged and do not exploit the rich hierarchy of features encoded across vision layers. Existing multi-layer fusion methods partially address this limitation but remain static, applying the same layer mixture regardless of the query. In this work, we introduce TGIF (Text-Guided Inter-layer Fusion), a lightweight module that treats encoder layers as depth-wise "experts" and predicts a prompt-dependent fusion of visual features. TGIF follows the principle of direct external fusion, requires no vision-encoder updates, and adds minimal overhead. Integrated into LLaVA-1.5-7B, TGIF provides consistent improvements across hallucination, OCR, and VQA benchmarks, while preserving or improving performance on ScienceQA, GQA, and MMBench. These results suggest that query-conditioned, hierarchy-aware fusion is an effective way to strengthen visual grounding and reduce hallucination in modern MLLMs.
多模态大型语言模型(MLLMs)通常依赖于冻结的视觉编码器中的单一深层特征,这使得编码器中丰富的层次化视觉线索未被充分利用。MLLMs仍然存在视觉基础不足的问题,常常过度依赖语言先验知识而非图像证据。尽管许多先前的缓解策略主要针对文本侧进行操作,但它们没有改变视觉表示,并且未能利用各个视觉层所包含的丰富特征层级结构。现有的多层融合方法部分解决了这一局限性,但仍保持静态特性,即不管查询内容如何都采用相同的层次混合方式。 在此项工作中,我们引入了TGIF(Text-Guided Inter-layer Fusion,文本引导跨层融合),这是一种轻量级模块,它将编码器各层视为深度方向上的“专家”,并根据提示预测视觉特征的融合。TGIF遵循直接外部融合的原则,无需更新视觉编码器,并且添加的开销极小。当集成到LLaVA-1.5-7B中时,TGIF在幻觉、OCR和VQA基准测试上提供了持续改进,同时在ScienceQA、GQA和MMBench等任务中的性能得以保持或提升。 这些结果表明,基于查询条件的层次感知融合是强化现代MLLMs视觉基础并减少幻觉的一种有效方法。
https://arxiv.org/abs/2601.03100
Bahnar, a minority language spoken across Vietnam, Cambodia, and Laos, faces significant preservation challenges due to limited research and data availability. This study addresses the critical need for accurate digitization of Bahnar language documents through optical character recognition (OCR) technology. Digitizing scanned paper documents poses significant challenges, as degraded image quality from broken or blurred areas introduces considerable OCR errors that compromise information retrieval systems. We propose a comprehensive approach combining advanced table and non-table detection techniques with probability-based post-processing heuristics to enhance recognition accuracy. Our method first applies detection algorithms to improve input data quality, then employs probabilistic error correction on OCR output. Experimental results indicate a substantial improvement, with recognition accuracy increasing from 72.86% to 79.26%. This work contributes valuable resources for Bahnar language preservation and provides a framework applicable to other minority language digitization efforts.
巴纳语(Bahnar)是越南、柬埔寨和老挝地区的一种少数民族语言,面临着由于研究不足和数据稀缺而导致的保存挑战。本研究旨在通过光学字符识别(OCR)技术准确地对巴纳语文档进行数字化处理,以解决这一关键问题。将纸质文件扫描并数字化的过程中存在重大挑战,因为破损或模糊区域导致图像质量下降,从而引入了大量的OCR错误,影响信息检索系统的准确性。 我们提出了一种结合先进的表格和非表格检测方法与基于概率的后处理启发式算法的全面解决方案,以提高识别精度。该方法首先使用检测算法来改进输入数据的质量,然后在OCR输出上应用概率误差校正。实验结果表明,在采用上述方案之后,识别准确率从72.86%提升到了79.26%,取得了显著的进步。 这项工作为巴纳语的保存提供了宝贵的资源,并且也为其他少数民族语言的数字化努力提供了一个适用框架。
https://arxiv.org/abs/2601.02965
Rapid motorization in emerging economies such as India has created severe enforcement asymmetries, with over 11 million recorded violations in 2023 against a human policing density of roughly one officer per 4000 vehicles. Traditional surveillance and manual ticketing cannot scale to this magnitude, motivating the need for an autonomous, cooperative, and energy efficient edge AI perception infrastructure. This paper presents a real time roadside perception node for multi class traffic violation analytics and safety event dissemination within a connected and intelligent vehicle ecosystem. The node integrates YOLOv8 Nano for high accuracy multi object detection, DeepSORT for temporally consistent vehicle tracking, and a rule guided OCR post processing engine capable of recognizing degraded or multilingual license plates compliant with MoRTH AIS 159 and ISO 7591 visual contrast standards. Deployed on an NVIDIA Jetson Nano with a 128 core Maxwell GPU and optimized via TensorRT FP16 quantization, the system sustains 28 to 30 frames per second inference at 9.6 W, achieving 97.7 percent violation detection accuracy and 84.9 percent OCR precision across five violation classes, namely signal jumping, zebra crossing breach, wrong way driving, illegal U turn, and speeding, without manual region of interest calibration. Comparative benchmarking against YOLOv4 Tiny, PP YOLOE S, and Nano DetPlus demonstrates a 10.7 percent mean average precision gain and a 1.4 times accuracy per watt improvement. Beyond enforcement, the node publishes standardized safety events of CAM and DENM type to connected vehicles and intelligent transportation system backends via V2X protocols, demonstrating that roadside edge AI analytics can augment cooperative perception and proactive road safety management within the IEEE Intelligent Vehicles ecosystem.
在新兴经济体如印度快速汽车化的背景下,执法对技术的需求变得极为迫切。截至2023年,仅记录的交通违法行为就超过了1100万起,而每4000辆汽车才有一名警察进行管理,这导致了严重的执行不对称性。传统的监控和手动开罚单的方式无法应对如此大规模的问题,因此需要一种自主、协作且节能高效的边缘人工智能感知基础设施来解决这些问题。 本文介绍了一种实时路边感知节点,用于多类交通违规行为分析以及安全事件的传播,在连接智能车辆生态系统中发挥作用。该节点集成了YOLOv8 Nano进行高精度的多目标检测、DeepSORT进行时间上连贯的车辆跟踪,并且通过规则引导的OCR后处理引擎来识别符合MoRTH AIS 159和ISO 7591视觉对比标准的退化或多种语言的车牌。该系统部署在配备有128核Maxwell GPU的NVIDIA Jetson Nano上,通过TensorRT FP16量化进行优化,在功耗为9.6瓦的情况下维持每秒28到30帧的推理速度,并实现了高达97.7%的违规检测准确率和84.9%的OCR精度。这些结果覆盖了五个违规类别:闯红灯、违反人行横道规定、逆向行驶、非法掉头以及超速,无需手动设置感兴趣区域。 与YOLOv4 Tiny、PP YOLOE S和Nano DetPlus进行比较基准测试显示,在平均准确度上该系统有10.7%的提升,并且在每瓦性能方面提高了1.4倍。此外,该节点能够通过V2X协议向联网车辆和智能交通系统的后台发布标准化的安全事件(包括CAM类型与DENM类型),表明路边边缘AI分析可以增强协作感知并促进主动的道路安全管理工作,在IEEE智能汽车生态系统中发挥重要作用。 这一系统不仅在执法方面具有显著优势,还能够在更广泛的智能交通管理领域内发挥作用,展示了技术如何解决实际问题并提升交通安全水平。
https://arxiv.org/abs/2601.07845
Optimization is fundamental across numerous disciplines, typically following an iterative process of refining an initial solution to enhance performance. This principle is equally critical in prompt engineering, where designing effective prompts for large language models constitutes a complex optimization challenge. A structured optimization approach requires automated or semi-automated procedures to develop improved prompts, thereby reducing manual effort, improving performance, and yielding an interpretable process. However, current prompt optimization methods often induce prompt drift, where new prompts fix prior failures but impair performance on previously successful tasks. Additionally, generating prompts from scratch can compromise interpretability. To address these limitations, this study proposes the Hierarchical Attribution Prompt Optimization (HAPO) framework, which introduces three innovations: (1) a dynamic attribution mechanism targeting error patterns in training data and prompting history, (2) semantic-unit optimization for editing functional prompt segments, and (3) multimodal-friendly progression supporting both end-to-end LLM and LLM-MLLM workflows. Applied in contexts like single/multi-image QA (e.g., OCRV2) and complex task analysis (e.g., BBH), HAPO demonstrates enhanced optimization efficiency, outperforming comparable automated prompt optimization methods and establishing an extensible paradigm for scalable prompt engineering.
优化是众多学科中的核心原则,通常遵循一种迭代过程:从初始解决方案开始,逐步改进以提升性能。这一原理在提示工程中同样重要,因为为大型语言模型设计有效的提示构成了一项复杂的优化挑战。一个有条理的优化方法需要自动或半自动的过程来开发更好的提示,从而减少手动工作量、提高性能并生成可解释的过程。然而,当前的提示优化方法常常会导致提示漂移:新提示虽然解决了以前的问题,但可能会损害之前成功任务的表现。此外,从头开始创建提示可能会破坏其可解释性。 为了解决这些问题,本研究提出了分层归因提示优化(HAPO)框架,引入了三项创新: 1. 动态归因机制,专注于训练数据和提示历史中的错误模式。 2. 语义单元优化,用于编辑功能提示片段。 3. 多模态友好的流程支持,适用于端到端的大型语言模型(LLM)以及 LLM 和多语言大规模语言模型(MLLM)的工作流。 在单一或多图像问答(如 OCRV2)、复杂任务分析(如 BBH)等场景中应用后,HAPO 展示了优化效率的显著提升。它不仅超越了可比较的自动化提示优化方法,并且建立了一个具有扩展性的框架,用于大规模的提示工程实践。
https://arxiv.org/abs/2601.02683
This survey has provided a systematic overview of the emerging field of LLM-enabled compilation by addressing several key research questions. We first answered how LLMs are being integrated by proposing a comprehensive, multi-dimensional taxonomy that categorizes works based on their Design Philosophy (Selector, Translator, Generator), LLM Methodology, their operational Level of Code Abstraction, and the specific Task Type they address. In answering what advancements these approaches offer, we identified three primary benefits: the democratization of compiler development, the discovery of novel optimization strategies, and the broadening of the compiler's traditional scope. Finally, in addressing the field's challenges and opportunities, we highlighted the critical hurdles of ensuring correctness and achieving scalability, while identifying the development of hybrid systems as the most promising path forward. By providing these answers, this survey serves as a foundational roadmap for researchers and practitioners, charting the course for a new generation of LLM-powered, intelligent, adaptive and synergistic compilation tools.
这项调查通过对几个关键研究问题的回答,为基于大语言模型(LLM)的编译领域提供了系统性的概述。首先,我们回答了如何将大语言模型集成到这一领域的做法,提出了一个全面、多维度的分类法,根据设计哲学(选择器、翻译器、生成器)、大语言模型的方法论、代码抽象层次以及所解决的具体任务类型对相关工作进行了分类。 在回答这些方法提供了哪些进展的问题时,我们确定了三大主要好处:编译器开发的民主化、发现新型优化策略和扩大传统编译器的作用范围。最后,在解答该领域面临的挑战和机遇时,我们指出了确保正确性和实现可扩展性的关键障碍,并识别出混合系统的发展是最有前景的方向。 通过提供这些答案,这项调查为研究人员和从业者提供了基础路线图,为下一代由大语言模型驱动的智能、适应性强且具有协同效应的编译工具铺平了道路。
https://arxiv.org/abs/2601.02045
Recent advancements adopt online reinforcement learning (RL) from LLMs to text-to-image rectified flow diffusion models for reward alignment. The use of group-level rewards successfully aligns the model with the targeted reward. However, it faces challenges including low efficiency, dependency on stochastic samplers, and reward hacking. The problem is that rectified flow models are fundamentally different from LLMs: 1) For efficiency, online image sampling takes much more time and dominates the time of training. 2) For stochasticity, rectified flow is deterministic once the initial noise is fixed. Aiming at these problems and inspired by the effects of group-level rewards from LLMs, we design Group-level Direct Reward Optimization (GDRO). GDRO is a new post-training paradigm for group-level reward alignment that combines the characteristics of rectified flow models. Through rigorous theoretical analysis, we point out that GDRO supports full offline training that saves the large time cost for image rollout sampling. Also, it is diffusion-sampler-independent, which eliminates the need for the ODE-to-SDE approximation to obtain stochasticity. We also empirically study the reward hacking trap that may mislead the evaluation, and involve this factor in the evaluation using a corrected score that not only considers the original evaluation reward but also the trend of reward hacking. Extensive experiments demonstrate that GDRO effectively and efficiently improves the reward score of the diffusion model through group-wise offline optimization across the OCR and GenEval tasks, while demonstrating strong stability and robustness in mitigating reward hacking.
最近的研究采用在线强化学习(RL)从大规模语言模型(LLMs)中提取经验,应用于文本到图像的校正流扩散模型以实现奖励对齐。使用群体层面的奖励成功地使模型与目标奖励相一致。然而,这种方法面临低效率、依赖于随机采样器以及奖励操纵等问题。问题在于校正流模型从根本上不同于大规模语言模型(LLMs): 1. 效率方面:在线图像采样需要更多时间,且占用了训练过程的大部分时间。 2. 随机性方面:一旦初始噪声被固定,校正流是确定性的。 针对这些问题,并受到群体层面奖励在大规模语言模型中效果的启发,我们设计了群体层级直接奖励优化(GDRO)。GDRO是一种新的后训练范式,旨在实现群体层面上的奖励对齐,同时结合了校正流模型的特点。通过严格的理论分析,我们指出GDRO支持完全离线训练,从而节省了大量的图像采样时间成本。此外,它还与扩散采样器无关,消除了使用ODE到SDE近似以获得随机性的需求。 我们也从经验上研究了奖励操纵陷阱可能误导评估的问题,并在评估中加入了一个修正分数,该分数不仅考虑原始的评价奖励,还包括奖励操纵的趋势。广泛的实验表明,GDRO通过群体级别的离线优化有效且高效地提高了扩散模型的奖励得分,在OCR和GenEval任务中表现出色。此外,它还展示了强大的稳定性和抗奖励操纵的能力。
https://arxiv.org/abs/2601.02036
Thyroid cancer is the most common endocrine malignancy, and its incidence is rising globally. While ultrasound is the preferred imaging modality for detecting thyroid nodules, its diagnostic accuracy is often limited by challenges such as low image contrast and blurred nodule boundaries. To address these issues, we propose Nodule-DETR, a novel detection transformer (DETR) architecture designed for robust thyroid nodule detection in ultrasound images. Nodule-DETR introduces three key innovations: a Multi-Spectral Frequency-domain Channel Attention (MSFCA) module that leverages frequency analysis to enhance features of low-contrast nodules; a Hierarchical Feature Fusion (HFF) module for efficient multi-scale integration; and Multi-Scale Deformable Attention (MSDA) to flexibly capture small and irregularly shaped nodules. We conducted extensive experiments on a clinical dataset of real-world thyroid ultrasound images. The results demonstrate that Nodule-DETR achieves state-of-the-art performance, outperforming the baseline model by a significant margin of 0.149 in mAP@0.5:0.95. The superior accuracy of Nodule-DETR highlights its significant potential for clinical application as an effective tool in computer-aided thyroid diagnosis. The code of work is available at this https URL.
甲状腺癌是最常见的内分泌恶性肿瘤,其发病率在全球范围内呈上升趋势。尽管超声是用于检测甲状腺结节的首选影像模态,但其诊断准确性往往受限于图像对比度低和结节边界模糊等问题。为了解决这些问题,我们提出了Nodule-DETR,这是一种新型的基于检测变压器(DETR)架构的设计,专为在超声图像中进行稳健的甲状腺结节检测而设计。Nodule-DETR引入了三个关键创新:一个多光谱频域通道注意(MSFCA)模块,该模块利用频率分析来增强低对比度结节的特征;一个层次化特征融合(HFF)模块,用于高效的多尺度集成;以及一种多尺度可变形注意力(MSDA),能够灵活地捕捉小而形状不规则的结节。我们在一组临床甲状腺超声图像的真实数据集上进行了广泛的实验。结果表明,Nodule-DETR达到了最先进的性能,在mAP@0.5:0.95指标下比基线模型高出显著的0.149分。Nodule-DETR的高准确性突显了它在计算机辅助甲状腺诊断中的临床应用潜力,作为一项有效工具。该工作的代码可在[此链接](https://this-url.com)获取。
https://arxiv.org/abs/2601.01908