Abstract
The requirement of large amounts of annotated images has become one grand challenge while training deep neural network models for various visual detection and recognition tasks. This paper presents a novel image synthesis technique that aims to generate a large amount of annotated scene text images for training accurate and robust scene text detection and recognition models. The proposed technique consists of three innovative designs. First, it realizes "semantic coherent" synthesis by embedding texts at semantically sensible regions within the background image, where the semantic coherence is achieved by leveraging the semantic annotations of objects and image regions that have been created in the prior semantic segmentation research. Second, it exploits visual saliency to determine the embedding locations within each semantic sensible region, which coincides with the fact that texts are often placed around homogeneous regions for better visibility in scenes. Third, it designs an adaptive text appearance model that determines the color and brightness of embedded texts by learning from the feature of real scene text images adaptively. The proposed technique has been evaluated over five public datasets and the experiments show its superior performance in training accurate and robust scene text detection and recognition models.
Abstract (translated)
在为各种视觉检测和识别任务训练深度神经网络模型时,对大量注释图像的要求已成为一项重大挑战。本文提出了一种新颖的图像合成技术,旨在生成大量带注释的场景文本图像,用于训练准确和鲁棒的场景文本检测和识别模型。所提出的技术包括三种创新设计。首先,它通过在背景图像内的语义敏感区域嵌入文本来实现“语义连贯”合成,其中通过利用在先前语义分割研究中创建的对象和图像区域的语义注释来实现语义一致性。其次,它利用视觉显着性来确定每个语义敏感区域内的嵌入位置,这与文本通常放置在同质区域周围以便在场景中获得更好可见性的事实一致。第三,设计了一种自适应文本外观模型,通过自适应地学习真实场景文本图像的特征,确定嵌入文本的颜色和亮度。所提出的技术已经在五个公共数据集上进行了评估,并且实验表明其在训练精确和稳健的场景文本检测和识别模型方面的优越性能。
URL
https://arxiv.org/abs/1807.03021