Paper Reading AI Learner

Research on Multilingual Natural Scene Text Detection Algorithm


Abstract

Natural scene text detection is a significant challenge in computer vision, with tremendous potential applications in multilingual, diverse, and complex text scenarios. We propose a multilingual text detection model to address the issues of low accuracy and high difficulty in detecting multilingual text in natural scenes. In response to the challenges posed by multilingual text images with multiple character sets and various font styles, we introduce the SFM Swin Transformer feature extraction network to enhance the model's robustness in detecting characters and fonts across different languages. Dealing with the considerable variation in text scales and complex arrangements in natural scene text images, we present the AS-HRFPN feature fusion network by incorporating an Adaptive Spatial Feature Fusion module and a Spatial Pyramid Pooling module. The feature fusion network improvements enhance the model's ability to detect text sizes and orientations. Addressing diverse backgrounds and font variations in multilingual scene text images is a challenge for existing methods. Limited local receptive fields hinder detection performance. To overcome this, we propose a Global Semantic Segmentation Branch, extracting and preserving global features for more effective text detection, aligning with the need for comprehensive information. In this study, we collected and built a real-world multilingual natural scene text image dataset and conducted comprehensive experiments and analyses. The experimental results demonstrate that the proposed algorithm achieves an F-measure of 85.02\%, which is 4.71\% higher than the baseline model. We also conducted extensive cross-dataset validation on MSRA-TD500, ICDAR2017MLT, and ICDAR2015 datasets to verify the generality of our approach. The code and dataset can be found at this https URL.

Abstract (translated)

自然场景文本检测是计算机视觉领域的一个重要挑战,具有多语言、多样化和复杂文本场景的巨大应用潜力。我们提出了一种多语言文本检测模型,以解决在自然场景中检测多语言文本的准确性和难度较高的难题。针对多语言文本图像具有多个字符集和各种字体样式所带来的挑战,我们引入了SFM Swin Transformer特征提取网络,以增强模型在检测不同语言中的字符和字体的鲁棒性。处理自然场景文本图像中文本尺寸和复杂排列所带来的巨大变化,我们通过将自适应空间特征融合网络和空间金字塔池化模块相结合,提出了AS-HRFPN特征融合网络。特征融合网络改进了模型的检测文本大小和方向的能力。解决多语言场景文本图像中的多样背景和字体变化是一个挑战,现有的方法。有限的局部接收域阻碍了检测性能。为了克服这一挑战,我们提出了全局语义分割分支,提取和保留全局特征以实现更有效的文本检测,与需要全面信息的需求相吻合。在本研究中,我们收集并构建了一个真实世界多语言自然场景文本图像数据集,并进行了全面实验和分析。实验结果表明,与基线模型相比,所提出的算法实现了85.02%的F1分数,比基线模型高4.71%。我们还对MSRA-TD500、ICDAR2017MLT和ICDAR2015等数据集进行了广泛的跨数据集验证,以验证我们方法的普适性。代码和数据集可以在这个链接中找到。

URL

https://arxiv.org/abs/2312.11153

PDF

https://arxiv.org/pdf/2312.11153.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot