Paper Reading AI Learner

Word length-aware text spotting: Enhancing detection and recognition in dense text image

2023-12-25 10:46:20
Hao Wang, Huabing Zhou, Yanduo Zhang, Tao Lu, Jiayi Ma

Abstract

Scene text spotting is essential in various computer vision applications, enabling extracting and interpreting textual information from images. However, existing methods often neglect the spatial semantics of word images, leading to suboptimal detection recall rates for long and short words within long-tailed word length distributions that exist prominently in dense scenes. In this paper, we present WordLenSpotter, a novel word length-aware spotter for scene text image detection and recognition, improving the spotting capabilities for long and short words, particularly in the tail data of dense text images. We first design an image encoder equipped with a dilated convolutional fusion module to integrate multiscale text image features effectively. Then, leveraging the Transformer framework, we synergistically optimize text detection and recognition accuracy after iteratively refining text region image features using the word length prior. Specially, we design a Spatial Length Predictor module (SLP) using character count prior tailored to different word lengths to constrain the regions of interest effectively. Furthermore, we introduce a specialized word Length-aware Segmentation (LenSeg) proposal head, enhancing the network's capacity to capture the distinctive features of long and short terms within categories characterized by long-tailed distributions. Comprehensive experiments on public datasets and our dense text spotting dataset DSTD1500 demonstrate the superiority of our proposed methods, particularly in dense text image detection and recognition tasks involving long-tailed word length distributions encompassing a range of long and short words.

Abstract (translated)

场景文本检测是在各种计算机视觉应用中必不可少的,它能够从图像中提取和解释文本信息。然而,现有的方法通常忽视了单词图像的空间语义,导致对于存在突出长尾分布的长短单词,其检测召回率往往较低。在本文中,我们提出了WordLenSpotter,一种新颖的基于单词长度的场景文本检测器,特别关注长短单词在密集场景中的尾数据。我们首先设计了一个带有 dilated 卷积融合模块的图像编码器,以有效地整合多尺度文本图像特征。接着,利用Transformer框架,我们通过迭代优化逐个细化文本区域图像特征,从而协同提高文本检测和识别的准确性。特别是,我们针对不同单词长度设计了一个Spatial Length Predictor(SLP)模块,可以有效地约束感兴趣区域的范围。此外,我们还引入了一个专门的长度感知分割(LenSeg)提议头,可以增强网络在具有长尾分布的类别的特征捕捉能力。 在公开数据集和我们的密集文本检测数据集DSTD1500上进行的全面实验证明了我们提出方法的优越性,尤其是在涉及长短单词长度的复杂场景文本检测和识别任务中。

URL

https://arxiv.org/abs/2312.15690

PDF

https://arxiv.org/pdf/2312.15690.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot