Abstract
Scene text detection, an essential step of scene text recognition system, is to locate text instances in natural scene images automatically. Some recent attempts benefiting from Mask R-CNN formulate scene text detection task as an instance segmentation problem and achieve remarkable performance. In this paper, we present a new Mask R-CNN based framework named Pyramid Mask Text Detector (PMTD) to handle the scene text detection. Instead of binary text mask generated by the existing Mask R-CNN based methods, our PMTD performs pixel-level regression under the guidance of location-aware supervision, yielding a more informative soft text mask for each text instance. As for the generation of text boxes, PMTD reinterprets the obtained 2D soft mask into 3D space and introduces a novel plane clustering algorithm to derive the optimal text box on the basis of 3D shape. Experiments on standard datasets demonstrate that the proposed PMTD brings consistent and noticeable gain and clearly outperforms state-of-the-art methods. Specifically, it achieves an F-measure of 80.13% on ICDAR 2017 MLT dataset.
Abstract (translated)
场景文本检测是场景文本识别系统中的一个重要环节,是自动定位自然场景图像中的文本实例。最近,一些得益于mask r-cnn的尝试将场景文本检测任务作为一个实例分割问题,并取得了显著的性能。本文提出了一种新的基于掩模R-CNN的金字塔掩模文本检测器(PMTD)框架来处理场景文本检测。我们的PMTD在位置感知监控的指导下执行像素级回归,而不是由现有的基于掩模R-CNN的方法生成的二进制文本掩模,从而为每个文本实例生成一个信息更丰富的软文本掩模。在文本框的生成方面,PMTD将得到的二维软掩模重新解释为三维空间,并引入一种新的平面聚类算法,在三维形状的基础上导出最优文本框。对标准数据集的实验表明,所提出的PMTD具有一致和显著的增益,明显优于最先进的方法。具体来说,它在ICDAR 2017 MLT数据集上实现了80.13%的F度量。
URL
https://arxiv.org/abs/1903.11800