Abstract
Scene text spotting is essential in various computer vision applications, enabling extracting and interpreting textual information from images. However, existing methods often neglect the spatial semantics of word images, leading to suboptimal detection recall rates for long and short words within long-tailed word length distributions that exist prominently in dense scenes. In this paper, we present WordLenSpotter, a novel word length-aware spotter for scene text image detection and recognition, improving the spotting capabilities for long and short words, particularly in the tail data of dense text images. We first design an image encoder equipped with a dilated convolutional fusion module to integrate multiscale text image features effectively. Then, leveraging the Transformer framework, we synergistically optimize text detection and recognition accuracy after iteratively refining text region image features using the word length prior. Specially, we design a Spatial Length Predictor module (SLP) using character count prior tailored to different word lengths to constrain the regions of interest effectively. Furthermore, we introduce a specialized word Length-aware Segmentation (LenSeg) proposal head, enhancing the network's capacity to capture the distinctive features of long and short terms within categories characterized by long-tailed distributions. Comprehensive experiments on public datasets and our dense text spotting dataset DSTD1500 demonstrate the superiority of our proposed methods, particularly in dense text image detection and recognition tasks involving long-tailed word length distributions encompassing a range of long and short words.
Abstract (translated)
场景文本检测是在各种计算机视觉应用中必不可少的,它能够从图像中提取和解释文本信息。然而,现有的方法通常忽视了单词图像的空间语义,导致对于存在突出长尾分布的长短单词,其检测召回率往往较低。在本文中,我们提出了WordLenSpotter,一种新颖的基于单词长度的场景文本检测器,特别关注长短单词在密集场景中的尾数据。我们首先设计了一个带有 dilated 卷积融合模块的图像编码器,以有效地整合多尺度文本图像特征。接着,利用Transformer框架,我们通过迭代优化逐个细化文本区域图像特征,从而协同提高文本检测和识别的准确性。特别是,我们针对不同单词长度设计了一个Spatial Length Predictor(SLP)模块,可以有效地约束感兴趣区域的范围。此外,我们还引入了一个专门的长度感知分割(LenSeg)提议头,可以增强网络在具有长尾分布的类别的特征捕捉能力。 在公开数据集和我们的密集文本检测数据集DSTD1500上进行的全面实验证明了我们提出方法的优越性,尤其是在涉及长短单词长度的复杂场景文本检测和识别任务中。
URL
https://arxiv.org/abs/2312.15690