Abstract
Recently, models based on deep neural networks have dominated the fields of scene text detection and recognition. In this paper, we investigate the problem of scene text spotting, which aims at simultaneous text detection and recognition in natural images. An end-to-end trainable neural network model for scene text spotting is proposed. The proposed model, named as Mask TextSpotter, is inspired by the newly published work Mask R-CNN. Different from previous methods that also accomplish text spotting with end-to-end trainable deep neural networks, Mask TextSpotter takes advantage of simple and smooth end-to-end learning procedure, in which precise text detection and recognition are acquired via semantic segmentation. Moreover, it is superior to previous methods in handling text instances of irregular shapes, for example, curved text. Experiments on ICDAR2013, ICDAR2015 and Total-Text demonstrate that the proposed method achieves state-of-the-art results in both scene text detection and end-to-end text recognition tasks.
Abstract (translated)
最近,基于深度神经网络的模型已经主导了场景文本检测和识别领域。在本文中,我们研究了场景文本定位的问题,其目的是在自然图像中同时进行文本检测和识别。提出了一种用于场景文本定位的端到端可训练神经网络模型。拟议的模型,名为Mask TextSpotter,受到新发布的工作Mask R-CNN的启发。与以前使用端到端可训练深度神经网络完成文本定位的方法不同,Mask TextSpotter利用简单,流畅的端到端学习过程,通过语义分割获得精确的文本检测和识别。此外,它在处理不规则形状的文本实例(例如,弯曲文本)方面优于先前的方法。 ICDAR2013,ICDAR2015和Total-Text上的实验表明,所提出的方法在场景文本检测和端到端文本识别任务中都实现了最先进的结果。
URL
https://arxiv.org/abs/1807.02242