Extracting coherent and human-understandable themes from large collections of unstructured historical newspaper archives presents significant challenges due to topic evolution, Optical Character Recognition (OCR) noise, and the sheer volume of text. Traditional topic-modeling methods, such as Latent Dirichlet Allocation (LDA), often fall short in capturing the complexity and dynamic nature of discourse in historical texts. To address these limitations, we employ BERTopic. This neural topic-modeling approach leverages transformerbased embeddings to extract and classify topics, which, despite its growing popularity, still remains underused in historical research. Our study focuses on articles published between 1955 and 2018, specifically examining discourse on nuclear power and nuclear safety. We analyze various topic distributions across the corpus and trace their temporal evolution to uncover long-term trends and shifts in public discourse. This enables us to more accurately explore patterns in public discourse, including the co-occurrence of themes related to nuclear power and nuclear weapons and their shifts in topic importance over time. Our study demonstrates the scalability and contextual sensitivity of BERTopic as an alternative to traditional approaches, offering richer insights into historical discourses extracted from newspaper archives. These findings contribute to historical, nuclear, and social-science research while reflecting on current limitations and proposing potential directions for future work.
从大量未结构化的历史报纸档案中提取连贯且易于人类理解的主题面临着重大挑战,这些挑战包括主题演变、光学字符识别(OCR)噪声以及海量文本数据。传统的主题建模方法,如潜在狄利克雷分配(LDA),在捕捉历史性文献复杂性和动态性方面往往表现不足。为了克服这一局限性,我们采用BERTopic进行研究。这种方法利用基于变压器的嵌入技术来提取和分类主题,尽管其日益流行,在历史研究中仍被广泛忽视。我们的研究关注1955年至2018年间发表的文章,特别是关于核能与核电安全方面的讨论。通过分析整个语料库中的各种主题分布及其时间演变,我们揭示了长期趋势以及公共话语的转变,并能够更准确地探索公众讨论模式,包括相关主题(如核能和核武器)的同时出现及其随时间变化的主题重要性。 我们的研究展示了BERTopic在处理历史文献方面的可扩展性和上下文敏感度,为传统方法提供了一种替代方案。这为我们提供了对从报纸档案中提取的历史话语的更深层次理解。这些发现不仅有助于历史、核能和社会科学的研究,而且还反思了当前限制,并提出了未来工作的潜在方向。
https://arxiv.org/abs/2512.11635
Document shadow removal is essential for enhancing the clarity of digitized documents. Preserving high-frequency details (e.g., text edges and lines) is critical in this process because shadows often obscure or distort fine structures. This paper proposes a matte vision transformer (MatteViT), a novel shadow removal framework that applies spatial and frequency-domain information to eliminate shadows while preserving fine-grained structural details. To effectively retain these details, we employ two preservation strategies. First, our method introduces a lightweight high-frequency amplification module (HFAM) that decomposes and adaptively amplifies high-frequency components. Second, we present a continuous luminance-based shadow matte, generated using a custom-built matte dataset and shadow matte generator, which provides precise spatial guidance from the earliest processing stage. These strategies enable the model to accurately identify fine-grained regions and restore them with high fidelity. Extensive experiments on public benchmarks (RDD and Kligler) demonstrate that MatteViT achieves state-of-the-art performance, providing a robust and practical solution for real-world document shadow removal. Furthermore, the proposed method better preserves text-level details in downstream tasks, such as optical character recognition, improving recognition performance over prior methods.
文档阴影去除对于提升数字化文档的清晰度至关重要。保留高频细节(例如文本边缘和线条)在这个过程中非常关键,因为阴影往往会遮蔽或扭曲细小结构。本文提出了一种新颖的阴影去除框架——MatteViT(图层视觉变换器),它结合了空间信息与频域信息来消除阴影的同时保持精细的结构细节。为了有效保留这些细节,我们采用两种保护策略。首先,我们的方法引入了一个轻量级高频放大模块(HFAM),该模块可以分解并自适应地增强高频分量。其次,我们提出了一种基于连续亮度的图层生成技术,通过使用定制构建的数据集和阴影图层生成器来提供精确的空间引导信息,从最早的处理阶段开始便发挥作用。这些策略使模型能够准确识别精细区域,并以高保真度进行恢复。 在公共基准测试(RDD和Kligler)上的广泛实验表明,MatteViT达到了最先进的性能水平,为现实世界中的文档阴影去除提供了稳健且实用的解决方案。此外,所提出的方法在下游任务中更好地保留了文本级别的细节,例如光学字符识别,其识别准确率优于先前方法。
https://arxiv.org/abs/2512.08789
Vision Large Language Models (VLLMs) incur high computational costs due to their reliance on hundreds of visual tokens to represent images. While token pruning offers a promising solution for accelerating inference, this paper, however, identifies a key observation: in deeper layers (e.g., beyond the 20th), existing training-free pruning methods perform no better than random pruning. We hypothesize that this degradation is caused by "vanishing token information", where visual tokens progressively lose their salience with increasing network depth. To validate this hypothesis, we quantify a token's information content by measuring the change in the model output probabilities upon its removal. Using this proposed metric, our analysis of the information of visual tokens across layers reveals three key findings: (1) As layers deepen, the information of visual tokens gradually becomes uniform and eventually vanishes at an intermediate layer, which we term as "information horizon", beyond which the visual tokens become redundant; (2) The position of this horizon is not static; it extends deeper for visually intensive tasks, such as Optical Character Recognition (OCR), compared to more general tasks like Visual Question Answering (VQA); (3) This horizon is also strongly correlated with model capacity, as stronger VLLMs (e.g., Qwen2.5-VL) employ deeper visual tokens than weaker models (e.g., LLaVA-1.5). Based on our findings, we show that simple random pruning in deep layers efficiently balances performance and efficiency. Moreover, integrating random pruning consistently enhances existing methods. Using DivPrune with random pruning achieves state-of-the-art results, maintaining 96.9% of Qwen-2.5-VL-7B performance while pruning 50% of visual tokens. The code will be publicly available at this https URL.
视觉大型语言模型(VLLMs)由于依赖数百个视觉令牌来表示图像,因此计算成本高昂。虽然令牌修剪为加速推理提供了一个有前景的解决方案,但本文却发现了关键观察:在深层网络中(例如超过第20层),现有的无训练修剪方法的表现并不比随机修剪更好。我们假设这种退化是由“消失的令牌信息”引起的,在这种情况下,随着网络深度的增加,视觉令牌逐渐失去其显著性。 为了验证这一假设,我们通过测量移除某个令牌后模型输出概率的变化来量化令牌的信息内容。使用这一提出的指标,对跨层的视觉令牌信息进行分析揭示了三个关键发现: 1. 随着层数加深,视觉令牌的信息变得越来越均匀,并在中间某一层完全消失,我们将这个点称为“信息地平线”,在此之后,视觉令牌变得多余。 2. 这个地平线的位置不是固定的;对于光学字符识别(OCR)等视觉密集型任务而言,它比诸如视觉问答(VQA)之类的通用任务延伸更深。 3. 此外,该地平线也与模型能力密切相关。更强的 VLLM 模型(例如 Qwen2.5-VL)使用更深层的视觉令牌,而较弱的模型则不如此。 根据我们的发现,我们在深层网络中展示了简单的随机修剪可以有效地平衡性能和效率。此外,整合随机修剪能够持续改进现有的方法。利用 DivPrune 结合随机修剪实现了最先进的成果,在剪枝50%视觉令牌的情况下,维持了 Qwen-2.5-VL-7B 模型96.9%的性能。 相关代码将公开发布在 https://this-url.com 的链接中。
https://arxiv.org/abs/2512.07580
Large Vision-Language Models (LVLMs) have demonstrated remarkable success in a broad range of vision-language tasks, such as general visual question answering and optical character recognition (OCR). However, their performance on perception-centric tasks -- such as object detection, semantic segmentation, and depth estimation -- remains significantly inferior to that of task-specific expert models. For example, Qwen2.5-VL-7B-Instruct achieves only 19% mAP on COCO2017 val, particularly struggling with dense scenes and small object recall. In this work, we introduce Chain-of-Thought for Detection (CoT4Det), a simple but efficient strategy that reformulates perception tasks into three interpretable steps: classification, counting, and grounding -- each more naturally aligned with the reasoning capabilities of LVLMs. Extensive experiments demonstrate that our method significantly improves perception performance without compromising general vision language capabilities. With a standard Qwen2.5-VL-7B-Instruct, CoT4Det boosts mAP from 19.0% to 33.0% on COCO2017 val and achieves competitive results across a variety of perception benchmarks, outperforming baselines by +2% on RefCOCO series and 19% on Flickr30k entities.
大型视觉-语言模型(LVLMs)在广泛的视觉-语言任务中展示了显著的成功,例如通用的视觉问答和光学字符识别(OCR)。然而,在以感知为中心的任务——如物体检测、语义分割和深度估计上,它们的表现仍然明显低于特定任务的专家模型。例如,Qwen2.5-VL-7B-Instruct在COCO2017验证集上的mAP仅为19%,尤其在密集场景和小目标召回方面表现不佳。 在这项工作中,我们介绍了用于检测的Chain-of-Thought方法(CoT4Det),这是一种简单而高效的策略,它将感知任务重新表述为三个可解释的步骤:分类、计数和定位——每个步骤都更自然地与LVLMs的推理能力相匹配。广泛的实验表明,我们的方法在不牺牲通用视觉语言能力的情况下显著提高了感知性能。使用标准的Qwen2.5-VL-7B-Instruct模型,CoT4Det将COCO2017验证集上的mAP从19.0%提升到33.0%,并在各种感知基准测试中取得了具有竞争力的结果,在RefCOCO系列上比基线高出+2%,在Flickr30k实体数据集上高出了19%。
https://arxiv.org/abs/2512.06663
The rise of Visual-Language Models (LVLMs) has unlocked new possibilities for seamlessly integrating visual and textual information. However, their ability to interpret cartographic maps remains largely unexplored. In this paper, we introduce CartoMapQA, a benchmark specifically designed to evaluate LVLMs' understanding of cartographic maps through question-answering tasks. The dataset includes over 2000 samples, each composed of a cartographic map, a question (with open-ended or multiple-choice answers), and a ground-truth answer. These tasks span key low-, mid- and high-level map interpretation skills, including symbol recognition, embedded information extraction, scale interpretation, and route-based reasoning. Our evaluation of both open-source and proprietary LVLMs reveals persistent challenges: models frequently struggle with map-specific semantics, exhibit limited geospatial reasoning, and are prone to Optical Character Recognition (OCR)-related errors. By isolating these weaknesses, CartoMapQA offers a valuable tool for guiding future improvements in LVLM architectures. Ultimately, it supports the development of models better equipped for real-world applications that depend on robust and reliable map understanding, such as navigation, geographic search, and urban planning. Our source code and data are openly available to the research community at: this https URL
视觉语言模型(LVLMs)的兴起开启了无缝融合视觉和文本信息的新可能。然而,这些模型解读地图的能力仍然鲜有探索。本文介绍了一个专门设计用于通过问答任务评估LVLM对地图理解能力的基准测试——CartoMapQA。该数据集包括超过2000个样本,每个样本由一张地图、一个问题(开放或多项选择答案)以及一个真实答案组成。这些任务涵盖了关键的地图解读技能,从低级到高级,包括符号识别、嵌入信息提取、比例尺解读和路线推理等。 我们对开源及专有LVLM的评估揭示了持续存在的挑战:模型在处理地图特定语义时经常遇到困难,展现出有限的空间推理能力,并且容易出现与光学字符识别(OCR)相关的错误。通过隔离这些弱点,CartoMapQA为未来改进LVLM架构提供了一个有价值的工具。最终,它支持开发更适合于依赖可靠地图理解的真实世界应用的模型,例如导航、地理搜索和城市规划。 我们的源代码和数据对研究社区开放获取:[此URL](this https URL)
https://arxiv.org/abs/2512.03558
Despite considerable progress in handwritten text recognition, paragraph-level handwritten text recognition, especially in low-resource languages, such as Hindi, Urdu and similar scripts, remains a challenging problem. These languages, often lacking comprehensive linguistic resources, require special attention to develop robust systems for accurate optical character recognition (OCR). This paper introduces BharatOCR, a novel segmentation-free paragraph-level handwritten Hindi and Urdu text recognition. We propose a ViT-Transformer Decoder-LM architecture for handwritten text recognition, where a Vision Transformer (ViT) extracts visual features, a Transformer decoder generates text sequences, and a pre-trained language model (LM) refines the output to improve accuracy, fluency, and coherence. Our model utilizes a Data-efficient Image Transformer (DeiT) model proposed for masked image modeling in this research work. In addition, we adopt a RoBERTa architecture optimized for masked language modeling (MLM) to enhance the linguistic comprehension and generative capabilities of the proposed model. The transformer decoder generates text sequences from visual embeddings. This model is designed to iteratively process a paragraph image line by line, called implicit line segmentation. The proposed model was evaluated using our custom dataset ('Parimal Urdu') and ('Parimal Hindi'), introduced in this research work, as well as two public datasets. The proposed model achieved benchmark results in the NUST-UHWR, PUCIT-OUHL, and Parimal-Urdu datasets, achieving character recognition rates of 96.24%, 92.05%, and 94.80%, respectively. The model also provided benchmark results using the Hindi dataset achieving a character recognition rate of 80.64%. The results obtained from our proposed model indicated that it outperformed several state-of-the-art Urdu text recognition methods.
尽管手写文本识别取得了显著进展,但段落级别的手写文本识别——尤其是在印地语、乌尔都语及其他类似书写系统的低资源语言中——仍然是一项具有挑战性的任务。这些语言往往缺乏全面的语言学资源,因此开发用于准确光学字符识别(OCR)的稳健系统需要特别关注。本文介绍了BharatOCR,这是一种针对手写印地语和乌尔都语段落文本的新颖无分割识别方法。 我们提出了一种基于视觉变换器-解码器语言模型架构的手写文本识别方法:一个视觉变换器(ViT)用于提取视觉特征,一个变压器解码器生成文本序列,而一个预训练的语言模型(LM)则通过提高准确度、流畅性和连贯性来优化输出。我们的模型采用了在本研究中提出的数据高效图像变换器(DeiT)架构,用于处理掩码图像建模,并采用经过优化的RoBERTa架构进行掩码语言建模,以增强所提模型的语言理解和生成能力。 变压器解码器从视觉嵌入中生成文本序列。该模型被设计为逐行迭代处理段落图像,称为隐式行分割。我们使用自定义数据集('Parimal Urdu'和'Parimal Hindi'),以及两个公开的数据集来评估提出的模型。在NUST-UHWR、PUCIT-OUHL及Parimal-Urdu数据集中,我们的模型分别实现了96.24%、92.05%及94.80%的字符识别率,并在印地语数据集上达到了80.64%的字符识别率。实验结果显示,本研究提出的模型优于多种最新的乌尔都语文本识别方法。
https://arxiv.org/abs/2512.01348
The point spread function (PSF) serves as a fundamental descriptor linking the real-world scene to the captured signal, manifesting as camera blur. Accurate PSF estimation is crucial for both optical characterization and computational vision, yet remains challenging due to the inherent ambiguity and the ill-posed nature of intensity-based deconvolution. We introduce CircleFlow, a high-fidelity PSF estimation framework that employs flow-guided edge localization for precise blur characterization. CircleFlow begins with a structured capture that encodes locally anisotropic and spatially varying PSFs by imaging a circle grid target, while leveraging the target's binary luminance prior to decouple image and kernel estimation. The latent sharp image is then reconstructed through subpixel alignment of an initialized binary structure guided by optical flow, whereas the PSF is modeled as an energy-constrained implicit neural representation. Both components are jointly optimized within a demosaicing-aware differentiable framework, ensuring physically consistent and robust PSF estimation enabled by accurate edge localization. Extensive experiments on simulated and real-world data demonstrate that CircleFlow achieves state-of-the-art accuracy and reliability, validating its effectiveness for practical PSF calibration.
点扩散函数(PSF)是将现实场景与捕获信号联系起来的基本描述符,表现为相机模糊。准确的PSF估计对于光学特性和计算视觉至关重要,但由于基于强度的去卷积固有的歧义和不适定性质而具有挑战性。我们介绍了CircleFlow,这是一种高保真度的PSF估算框架,采用流动引导边缘定位以精确表征模糊。CircleFlow从结构化捕捉开始,通过成像圆形网格目标来编码局部各向异性和空间变化的PSFs,并利用目标二进制亮度先验将图像和核估计解耦。然后,通过光学流引导初始化的二进制结构进行亚像素对齐重建潜在清晰图像,而PSF则被建模为能量受限的隐式神经表示。这两个组件在考虑到拜耳去马赛克处理的可微框架中联合优化,确保由准确边缘定位支持的物理一致性和稳健的PSF估计。在模拟和真实数据上的广泛实验表明,CircleFlow达到了最先进的精度和可靠性,验证了其在实际PSF校准中的有效性。
https://arxiv.org/abs/2512.00796
Business Process Model and Notation (BPMN) is a widely adopted standard for representing complex business workflows. While BPMN diagrams are often exchanged as visual images, existing methods primarily rely on XML representations for computational analysis. In this work, we present a pipeline that leverages Vision-Language Models (VLMs) to extract structured JSON representations of BPMN diagrams directly from images, without requiring source model files or textual annotations. We also incorporate optical character recognition (OCR) for textual enrichment and evaluate the generated element lists against ground truth data derived from the source XML files. Our approach enables robust component extraction in scenarios where original source files are unavailable. We benchmark multiple VLMs and observe performance improvements in several models when OCR is used for text enrichment. In addition, we conducted extensive statistical analyses of OCR-based enrichment methods and prompt ablation studies, providing a clearer understanding of their impact on model performance.
https://arxiv.org/abs/2511.22448
Large vision-language models (VLMs) are increasingly deployed for optical character recognition (OCR) in healthcare settings, raising critical concerns about protected health information (PHI) exposure during document processing. This work presents the first systematic evaluation of inference-time vision token masking as a privacy-preserving mechanism for medical document OCR using DeepSeek-OCR. We introduce seven masking strategies (V3-V9) targeting different architectural layers (SAM encoder blocks, compression layers, dual vision encoders, projector fusion) and evaluate PHI reduction across HIPAA-defined categories using 100 synthetic medical billing statements (drawn from a corpus of 38,517 annotated documents) with perfect ground-truth annotations. All masking strategies converge to 42.9% PHI reduction, successfully suppressing long-form spatially-distributed identifiers (patient names, dates of birth, physical addresses at 100% effectiveness) while failing to prevent short structured identifiers (medical record numbers, social security numbers, email addresses, account numbers at 0% effectiveness). Ablation studies varying mask expansion radius (r=1,2,3) demonstrate that increased spatial coverage does not improve reduction beyond this ceiling, indicating that language model contextual inference - not insufficient visual masking - drives structured identifier leakage. A simulated hybrid architecture combining vision masking with NLP post-processing achieves 88.6% total PHI reduction (assuming 80% NLP accuracy on remaining identifiers). This negative result establishes boundaries for vision-only privacy interventions in VLMs, provides guidance distinguishing PHI types amenable to vision-level versus language-level redaction, and redirects future research toward decoder-level fine-tuning and hybrid defense-in-depth architectures for HIPAA-compliant medical document processing.
https://arxiv.org/abs/2511.18272
Million-level token inputs in long-context tasks pose significant computational and memory challenges for Large Language Models (LLMs). Recently, DeepSeek-OCR conducted research into the feasibility of Contexts Optical Compression and achieved preliminary results. Inspired by this, we introduce Context Cascade Compression C3 to explore the upper limits of text compression. Our method cascades two LLMs of different sizes to handle the compression and decoding tasks. Specifically, a small LLM, acting as the first stage, performs text compression by condensing a long context into a set of latent tokens (e.g., 32 or 64 in length), achieving a high ratio of text tokens to latent tokens. A large LLM, as the second stage, then executes the decoding task on this compressed context. Experiments show that at a 20x compression ratio (where the number of text tokens is 20 times the number of latent tokens), our model achieves 98% decoding accuracy, compared to approximately 60% for DeepSeek-OCR. When we further increase the compression ratio to 40x, the accuracy is maintained at around 93%. This indicates that in the domain of context compression, C3 Compression demonstrates superior performance and feasibility over optical character compression. C3 uses a simpler, pure-text pipeline that ignores factors like layout, color, and information loss from a visual encoder. This also suggests a potential upper bound for compression ratios in future work on optical character compression, OCR, and related fields. Codes and model weights are publicly accessible at this https URL
https://arxiv.org/abs/2511.15244
Optical Character Recognition (OCR) for data extraction from documents is essential to intelligent informatics, such as digitizing medical records and recognizing road signs. Multi-modal Large Language Models (LLMs) can solve this task and have shown remarkable performance. Recently, it has been noticed that the accuracy of data extraction by multi-modal LLMs can be affected when in-plane rotations are present in the documents. However, real-world document images are usually not only in-plane rotated but also perspectively distorted. This study investigates the impacts of such perturbations on the data extraction accuracy for the state-of-the-art model, Gemini-1.5-pro. Because perspective distortions have a high degree of freedom, designing experiments in the same manner as single-parametric rotations is difficult. We observed typical distortions of document images and showed that most of them approximately follow an isosceles-trapezoidal transformation, which allows us to evaluate distortions with a small number of parameters. We were able to reduce the number of independent parameters from eight to two, i.e. rotation angle and distortion ratio. Then, specific entities were extracted from synthetically generated sample documents with varying these parameters. As the performance of LLMs, we evaluated not only a character-recognition accuracy but also a structure-recognition accuracy. Whereas the former represents the classical indicators for optical character recognition, the latter is related to the correctness of reading order. In particular, the structure-recognition accuracy was found to be significantly degraded by document distortion. In addition, we found that this accuracy can be improved by a simple rotational correction. This insight will contribute to the practical use of multi-modal LLMs for OCR tasks.
https://arxiv.org/abs/2511.17607
We introduce Orion, a visual agent framework that can take in any modality and generate any modality. Using an agentic framework with multiple tool-calling capabilities, Orion is designed for visual AI tasks and achieves state-of-the-art results. Unlike traditional vision-language models that produce descriptive outputs, Orion orchestrates a suite of specialized computer vision tools, including object detection, keypoint localization, panoptic segmentation, Optical Character Recognition, and geometric analysis, to execute complex multi-step visual workflows. The system achieves competitive performance on MMMU, MMBench, DocVQA, and MMLongBench while extending monolithic vision-language models to production-grade visual intelligence. By combining neural perception with symbolic execution, Orion enables autonomous visual reasoning, marking a transition from passive visual understanding to active, tool-driven visual intelligence.
https://arxiv.org/abs/2511.14210
The Video Browser Showdown (VBS) challenges systems to deliver accurate results under strict time constraints. To meet this demand, we present Fusionista2.0, a streamlined video retrieval system optimized for speed and usability. All core modules were re-engineered for efficiency: preprocessing now relies on ffmpeg for fast keyframe extraction, optical character recognition uses Vintern-1B-v3.5 for robust multilingual text recognition, and automatic speech recognition employs faster-whisper for real-time transcription. For question answering, lightweight vision-language models provide quick responses without the heavy cost of large models. Beyond these technical upgrades, Fusionista2.0 introduces a redesigned user interface with improved responsiveness, accessibility, and workflow efficiency, enabling even non-expert users to retrieve relevant content rapidly. Evaluations demonstrate that retrieval time was reduced by up to 75% while accuracy and user satisfaction both increased, confirming Fusionista2.0 as a competitive and user-friendly system for large-scale video search.
https://arxiv.org/abs/2511.12255
Kuzushiji, a pre-modern Japanese cursive script, can currently be read and understood by only a few thousand trained experts in Japan. With the rapid development of deep learning, researchers have begun applying Optical Character Recognition (OCR) techniques to transcribe Kuzushiji into modern Japanese. Although existing OCR methods perform well on clean pre-modern Japanese documents written in Kuzushiji, they often fail to consider various types of noise, such as document degradation and seals, which significantly affect recognition accuracy. To the best of our knowledge, no existing dataset specifically addresses these challenges. To address this gap, we introduce the Degraded Kuzushiji Documents with Seals (DKDS) dataset as a new benchmark for related tasks. We describe the dataset construction process, which required the assistance of a trained Kuzushiji expert, and define two benchmark tracks: (1) text and seal detection and (2) document binarization. For the text and seal detection track, we provide baseline results using multiple versions of the You Only Look Once (YOLO) models for detecting Kuzushiji characters and seals. For the document binarization track, we present baseline results from traditional binarization algorithms, traditional algorithms combined with K-means clustering, and Generative Adversarial Network (GAN)-based methods. The DKDS dataset and the implementation code for baseline methods are available at this https URL.
https://arxiv.org/abs/2511.09117
Reliable geospatial information on road accidents is vital for safety analysis and infrastructure planning, yet most low- and middle-income countries continue to face a critical shortage of accurate, location-specific crash data. Existing text-based geocoding tools perform poorly in multilingual and unstructured news environments, where incomplete place descriptions and mixed Bangla-English scripts obscure spatial context. To address these limitations, this study introduces ALIGN (Accident Location Inference through Geo-Spatial Neural Reasoning)- a vision-language framework that emulates human spatial reasoning to infer accident coordinates directly from textual and map-based cues. ALIGN integrates large language and vision-language models within a multi-stage pipeline that performs optical character recognition, linguistic reasoning, and map-level verification through grid-based spatial scanning. The framework systematically evaluates each predicted location against contextual and visual evidence, ensuring interpretable, fine-grained geolocation outcomes without requiring model retraining. Applied to Bangla-language news data, ALIGN demonstrates consistent improvements over traditional geoparsing methods, accurately identifying district and sub-district-level crash sites. Beyond its technical contribution, the framework establishes a high accuracy foundation for automated crash mapping in data-scarce regions, supporting evidence-driven road-safety policymaking and the broader integration of multimodal artificial intelligence in transportation analytics. The code for this paper is open-source and available at: this https URL
https://arxiv.org/abs/2511.06316
This research explores the fusion of graphology and artificial intelligence to quantify psychological stress levels in students by analyzing their handwritten examination scripts. By leveraging Optical Character Recognition and transformer based sentiment analysis models, we present a data driven approach that transcends traditional grading systems, offering deeper insights into cognitive and emotional states during examinations. The system integrates high resolution image processing, TrOCR, and sentiment entropy fusion using RoBERTa based models to generate a numerical Stress Index. Our method achieves robustness through a five model voting mechanism and unsupervised anomaly detection, making it an innovative framework in academic forensics.
https://arxiv.org/abs/2511.11633
Vision-language models (VLMs) have demonstrated impressive generalization across multimodal tasks, yet most evaluation benchmarks remain Western-centric, leaving open questions about their performance in culturally diverse and multilingual settings. To address this gap, we introduce IndicVisionBench, the first large-scale benchmark centered on the Indian subcontinent. Covering English and 10 Indian languages, our benchmark spans 3 multimodal tasks, including Optical Character Recognition (OCR), Multimodal Machine Translation (MMT), and Visual Question Answering (VQA), covering 6 kinds of question types. Our final benchmark consists of a total of ~5K images and 37K+ QA pairs across 13 culturally grounded topics. In addition, we release a paired parallel corpus of annotations across 10 Indic languages, creating a unique resource for analyzing cultural and linguistic biases in VLMs. We evaluate a broad spectrum of 8 models, from proprietary closed-source systems to open-weights medium and large-scale models. Our experiments reveal substantial performance gaps, underscoring the limitations of current VLMs in culturally diverse contexts. By centering cultural diversity and multilinguality, IndicVisionBench establishes a reproducible evaluation framework that paves the way for more inclusive multimodal research.
https://arxiv.org/abs/2511.04727
Developing accurate biomedical Question Answering (QA) systems in low-resource languages remains a major challenge, limiting equitable access to reliable medical knowledge. This paper introduces BanglaMedQA and BanglaMMedBench, the first large-scale Bangla biomedical Multiple Choice Question (MCQ) datasets designed to evaluate reasoning and retrieval in medical artificial intelligence (AI). The study applies and benchmarks several Retrieval-Augmented Generation (RAG) strategies, including Traditional, Zero-Shot Fallback, Agentic, Iterative Feedback, and Aggregate RAG, combining textbook-based and web retrieval with generative reasoning to improve factual accuracy. A key novelty lies in integrating a Bangla medical textbook corpus through Optical Character Recognition (OCR) and implementing an Agentic RAG pipeline that dynamically selects between retrieval and reasoning strategies. Experimental results show that the Agentic RAG achieved the highest accuracy 89.54% with openai/gpt-oss-120b, outperforming other configurations and demonstrating superior rationale quality. These findings highlight the potential of RAG-based methods to enhance the reliability and accessibility of Bangla medical QA, establishing a foundation for future research in multilingual medical artificial intelligence.
https://arxiv.org/abs/2511.04560
Despite significant advances in document understanding, determining the correct orientation of scanned or photographed documents remains a critical pre-processing step in the real world settings. Accurate rotation correction is essential for enhancing the performance of downstream tasks such as Optical Character Recognition (OCR) where misalignment commonly arises due to user errors, particularly incorrect base orientations of the camera during capture. In this study, we first introduce OCR-Rotation-Bench (ORB), a new benchmark for evaluating OCR robustness to image rotations, comprising (i) ORB-En, built from rotation-transformed structured and free-form English OCR datasets, and (ii) ORB-Indic, a novel multilingual set spanning 11 Indic mid to low-resource languages. We also present a fast, robust and lightweight rotation classification pipeline built on the vision encoder of Phi-3.5-Vision model with dynamic image cropping, fine-tuned specifically for 4-class rotation task in a standalone fashion. Our method achieves near-perfect 96% and 92% accuracy on identifying the rotations respectively on both the datasets. Beyond classification, we demonstrate the critical role of our module in boosting OCR performance: closed-source (up to 14%) and open-weights models (up to 4x) in the simulated real-world setting.
https://arxiv.org/abs/2511.04161
The detection of Protected Health Information (PHI) in medical imaging is critical for safeguarding patient privacy and ensuring compliance with regulatory frameworks. Traditional detection methodologies predominantly utilize Optical Character Recognition (OCR) models in conjunction with named entity recognition. However, recent advancements in Large Multimodal Model (LMM) present new opportunities for enhanced text extraction and semantic analysis. In this study, we systematically benchmark three prominent closed and open-sourced LMMs, namely GPT-4o, Gemini 2.5 Flash, and Qwen 2.5 7B, utilizing two distinct pipeline configurations: one dedicated to text analysis alone and another integrating both OCR and semantic analysis. Our results indicate that LMM exhibits superior OCR efficacy (WER: 0.03-0.05, CER: 0.02-0.03) compared to conventional models like EasyOCR. However, this improvement in OCR performance does not consistently correlate with enhanced overall PHI detection accuracy. The strongest performance gains are observed on test cases with complex imprint patterns. In scenarios where text regions are well readable with sufficient contrast, and strong LMMs are employed for text analysis after OCR, different pipeline configurations yield similar results. Furthermore, we provide empirically grounded recommendations for LMM selection tailored to specific operational constraints and propose a deployment strategy that leverages scalable and modular infrastructure.
https://arxiv.org/abs/2511.02014