Paper Reading AI Learner

Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models

2026-03-31 19:09:55
Longwei Xu, Feng Feng, Shaojie Zhang, Xin Chen, Hang Li, Anan Du, Hailong Yu, Pei Fu, Zhenbo Luo, Jian Luan

Abstract

Optical Character Recognition (OCR) is increasingly regarded as a foundational capability for modern vision-language models (VLMs), enabling them not only to read text in images but also to support downstream reasoning in real-world visual question answering (VQA). However, practical applications further require reliable text anchors, i.e., accurately grounding queried text to its corresponding spatial region. To systematically evaluate this capability, we introduce TextAnchor-Bench (TABench), a benchmark for fine-grained text-region grounding, which reveals that both general-purpose and OCR-specific VLMs still struggle to establish accurate and stable text anchors. To address this limitation, we propose Q-Mask, a precise OCR framework built upon a causal query-driven mask decoder (CQMD). Inspired by chain-of-thought reasoning, Q-Mask performs causal visual decoding that sequentially generates query-conditioned visual masks before producing the final OCR output. This visual CoT paradigm disentangles where the text is from what the text is, enforcing grounded evidence acquisition prior to recognition and enabling explicit text anchor construction during inference. To train CQMD, we construct TextAnchor-26M, a large-scale dataset of image-text pairs annotated with fine-grained masks corresponding to specific textual elements, encouraging stable text-region correspondences and injecting strong spatial priors into VLM training. Extensive experiments demonstrate that Q-Mask substantially improves text anchoring and understanding across diverse visual scenes.

Abstract (translated)

光学字符识别(OCR)正日益被视为现代视觉语言模型(VLM)的基础能力,它不仅使模型能够读取图像中的文本,还能支持现实世界视觉问答(VQA)中的下游推理。然而,实际应用进一步要求可靠的文本锚点,即准确地将查询文本定位到其对应的空间区域。为系统评估这一能力,我们推出了TextAnchor-Bench(TABench),一个用于细粒度文本-区域定位的基准测试,该测试揭示通用型和OCR专用型VLM仍难以建立准确稳定的文本锚点。为应对这一局限,我们提出了Q-Mask,一个基于因果查询驱动掩码解码器(CQMD)构建的精确OCR框架。受思维链推理启发,Q-Mask执行因果视觉解码,在生成最终OCR输出前,依次生成查询条件化的视觉掩码。这种视觉思维链范式将“文本位置”与“文本内容”解耦,强制在识别前获取有根据的证据,并在推理过程中实现显式的文本锚点构建。为训练CQMD,我们构建了TextAnchor-26M,一个大规模图像-文本对数据集,其中标注了与特定文本元素对应的细粒度掩码,以促进稳定的文本-区域对应关系,并将强空间先验注入VLM训练。大量实验表明,Q-Mask在多样化的视觉场景中显著提升了文本锚定与理解能力。

URL

https://arxiv.org/abs/2604.00161

PDF

https://arxiv.org/pdf/2604.00161.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot