Abstract
Scene text recognition is a rapidly developing field that faces numerous challenges due to the complexity and diversity of scene text, including complex backgrounds, diverse fonts, flexible arrangements, and accidental occlusions. In this paper, we propose a novel approach called Class-Aware Mask-guided feature refinement (CAM) to address these challenges. Our approach introduces canonical class-aware glyph masks generated from a standard font to effectively suppress background and text style noise, thereby enhancing feature discrimination. Additionally, we design a feature alignment and fusion module to incorporate the canonical mask guidance for further feature refinement for text recognition. By enhancing the alignment between the canonical mask feature and the text feature, the module ensures more effective fusion, ultimately leading to improved recognition performance. We first evaluate CAM on six standard text recognition benchmarks to demonstrate its effectiveness. Furthermore, CAM exhibits superiority over the state-of-the-art method by an average performance gain of 4.1% across six more challenging datasets, despite utilizing a smaller model size. Our study highlights the importance of incorporating canonical mask guidance and aligned feature refinement techniques for robust scene text recognition. The code is available at this https URL.
Abstract (translated)
场景文本识别是一个迅速发展的领域,由于场景文本的复杂性和多样性,包括复杂的背景、多样化的字体和灵活的排列以及意外的遮挡,面临着许多挑战。在本文中,我们提出了一个名为类感知引导特征细化(CAM)的新方法来应对这些挑战。我们的方法引入了一个标准字体生成的规范类感知 glyph 口罩,有效地抑制了背景和文本风格噪声,从而提高了特征识别效果。此外,我们还设计了一个特征对齐和融合模块,以进一步对文本识别进行特征细化。通过增强规范口罩特征与文本特征之间的对齐,该模块确保了更有效的融合,最终提高了识别性能。我们首先在六个标准文本识别基准上评估了CAM的有效性,以证明其有效性。此外,CAM在六个更具挑战性的数据集上的平均性能比最先进的方法提高了4.1%,尽管采用了更小的模型大小。我们的研究突出了将规范口罩指导和对齐特征细化技术纳入文本识别的重要性。代码可在此处访问:https://url.cn/
URL
https://arxiv.org/abs/2402.13643