Abstract
Optical Character Recognition (OCR) is increasingly regarded as a foundational capability for modern vision-language models (VLMs), enabling them not only to read text in images but also to support downstream reasoning in real-world visual question answering (VQA). However, practical applications further require reliable text anchors, i.e., accurately grounding queried text to its corresponding spatial region. To systematically evaluate this capability, we introduce TextAnchor-Bench (TABench), a benchmark for fine-grained text-region grounding, which reveals that both general-purpose and OCR-specific VLMs still struggle to establish accurate and stable text anchors. To address this limitation, we propose Q-Mask, a precise OCR framework built upon a causal query-driven mask decoder (CQMD). Inspired by chain-of-thought reasoning, Q-Mask performs causal visual decoding that sequentially generates query-conditioned visual masks before producing the final OCR output. This visual CoT paradigm disentangles where the text is from what the text is, enforcing grounded evidence acquisition prior to recognition and enabling explicit text anchor construction during inference. To train CQMD, we construct TextAnchor-26M, a large-scale dataset of image-text pairs annotated with fine-grained masks corresponding to specific textual elements, encouraging stable text-region correspondences and injecting strong spatial priors into VLM training. Extensive experiments demonstrate that Q-Mask substantially improves text anchoring and understanding across diverse visual scenes.
Abstract (translated)
光学字符识别(OCR)正日益被视为现代视觉语言模型(VLM)的基础能力,它不仅使模型能够读取图像中的文本,还能支持现实世界视觉问答(VQA)中的下游推理。然而,实际应用进一步要求可靠的文本锚点,即准确地将查询文本定位到其对应的空间区域。为系统评估这一能力,我们推出了TextAnchor-Bench(TABench),一个用于细粒度文本-区域定位的基准测试,该测试揭示通用型和OCR专用型VLM仍难以建立准确稳定的文本锚点。为应对这一局限,我们提出了Q-Mask,一个基于因果查询驱动掩码解码器(CQMD)构建的精确OCR框架。受思维链推理启发,Q-Mask执行因果视觉解码,在生成最终OCR输出前,依次生成查询条件化的视觉掩码。这种视觉思维链范式将“文本位置”与“文本内容”解耦,强制在识别前获取有根据的证据,并在推理过程中实现显式的文本锚点构建。为训练CQMD,我们构建了TextAnchor-26M,一个大规模图像-文本对数据集,其中标注了与特定文本元素对应的细粒度掩码,以促进稳定的文本-区域对应关系,并将强空间先验注入VLM训练。大量实验表明,Q-Mask在多样化的视觉场景中显著提升了文本锚定与理解能力。
URL
https://arxiv.org/abs/2604.00161