Paper Reading AI Learner

CMFN: Cross-Modal Fusion Network for Irregular Scene Text Recognition

2024-01-18 15:05:57
Jinzhi Zheng, Ruyi Ji, Libo Zhang, Yanjun Wu, Chen Zhao

Abstract

Scene text recognition, as a cross-modal task involving vision and text, is an important research topic in computer vision. Most existing methods use language models to extract semantic information for optimizing visual recognition. However, the guidance of visual cues is ignored in the process of semantic mining, which limits the performance of the algorithm in recognizing irregular scene text. To tackle this issue, we propose a novel cross-modal fusion network (CMFN) for irregular scene text recognition, which incorporates visual cues into the semantic mining process. Specifically, CMFN consists of a position self-enhanced encoder, a visual recognition branch and an iterative semantic recognition branch. The position self-enhanced encoder provides character sequence position encoding for both the visual recognition branch and the iterative semantic recognition branch. The visual recognition branch carries out visual recognition based on the visual features extracted by CNN and the position encoding information provided by the position self-enhanced encoder. The iterative semantic recognition branch, which consists of a language recognition module and a cross-modal fusion gate, simulates the way that human recognizes scene text and integrates cross-modal visual cues for text recognition. The experiments demonstrate that the proposed CMFN algorithm achieves comparable performance to state-of-the-art algorithms, indicating its effectiveness.

Abstract (translated)

场景文本识别是一个涉及视觉和文本的多模态任务,在计算机视觉领域是一个重要的研究课题。大多数现有方法使用语言模型提取语义信息来优化视觉识别。然而,在语义挖掘过程中,忽略了解视觉线索的指导,这限制了算法在识别不规则场景文本时的性能。为了应对这个问题,我们提出了一个新颖的跨模态融合网络(CMFN)用于不规则场景文本识别,其中将视觉线索融入语义挖掘过程。具体来说,CMFN由位置自增强编码器、视觉识别分支和迭代语义识别分支组成。位置自增强编码器为视觉识别分支和迭代语义识别分支提供字符序列位置编码。视觉识别分支根据CNN提取的视觉特征进行视觉识别,并提供位置自增强编码器提供的位置编码信息。迭代语义识别分支由语言识别模块和跨模态融合门组成,模拟了人类识别场景文本的方式,并整合了跨模态视觉线索进行文本识别。实验证明,与最先进的算法相比,所提出的CMFN算法具有可比较的性能,表明了其有效性。

URL

https://arxiv.org/abs/2401.10041

PDF

https://arxiv.org/pdf/2401.10041.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot