Paper Reading AI Learner

Class-Aware Mask-Guided Feature Refinement for Scene Text Recognition

2024-02-21 09:22:45
Mingkun Yang, Biao Yang, Minghui Liao, Yingying Zhu, Xiang Bai

Abstract

Scene text recognition is a rapidly developing field that faces numerous challenges due to the complexity and diversity of scene text, including complex backgrounds, diverse fonts, flexible arrangements, and accidental occlusions. In this paper, we propose a novel approach called Class-Aware Mask-guided feature refinement (CAM) to address these challenges. Our approach introduces canonical class-aware glyph masks generated from a standard font to effectively suppress background and text style noise, thereby enhancing feature discrimination. Additionally, we design a feature alignment and fusion module to incorporate the canonical mask guidance for further feature refinement for text recognition. By enhancing the alignment between the canonical mask feature and the text feature, the module ensures more effective fusion, ultimately leading to improved recognition performance. We first evaluate CAM on six standard text recognition benchmarks to demonstrate its effectiveness. Furthermore, CAM exhibits superiority over the state-of-the-art method by an average performance gain of 4.1% across six more challenging datasets, despite utilizing a smaller model size. Our study highlights the importance of incorporating canonical mask guidance and aligned feature refinement techniques for robust scene text recognition. The code is available at this https URL.

Abstract (translated)

场景文本识别是一个迅速发展的领域,由于场景文本的复杂性和多样性,包括复杂的背景、多样化的字体和灵活的排列以及意外的遮挡,面临着许多挑战。在本文中,我们提出了一个名为类感知引导特征细化(CAM)的新方法来应对这些挑战。我们的方法引入了一个标准字体生成的规范类感知 glyph 口罩,有效地抑制了背景和文本风格噪声,从而提高了特征识别效果。此外,我们还设计了一个特征对齐和融合模块,以进一步对文本识别进行特征细化。通过增强规范口罩特征与文本特征之间的对齐,该模块确保了更有效的融合,最终提高了识别性能。我们首先在六个标准文本识别基准上评估了CAM的有效性,以证明其有效性。此外,CAM在六个更具挑战性的数据集上的平均性能比最先进的方法提高了4.1%,尽管采用了更小的模型大小。我们的研究突出了将规范口罩指导和对齐特征细化技术纳入文本识别的重要性。代码可在此处访问:https://url.cn/

URL

https://arxiv.org/abs/2402.13643

PDF

https://arxiv.org/pdf/2402.13643.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot