Paper Reading AI Learner

Verisimilar Image Synthesis for Accurate Detection and Recognition of Texts in Scenes

2018-07-09 09:58:06
Fangneng Zhan, Shijian Lu, Chuhui Xue

Abstract

The requirement of large amounts of annotated images has become one grand challenge while training deep neural network models for various visual detection and recognition tasks. This paper presents a novel image synthesis technique that aims to generate a large amount of annotated scene text images for training accurate and robust scene text detection and recognition models. The proposed technique consists of three innovative designs. First, it realizes "semantic coherent" synthesis by embedding texts at semantically sensible regions within the background image, where the semantic coherence is achieved by leveraging the semantic annotations of objects and image regions that have been created in the prior semantic segmentation research. Second, it exploits visual saliency to determine the embedding locations within each semantic sensible region, which coincides with the fact that texts are often placed around homogeneous regions for better visibility in scenes. Third, it designs an adaptive text appearance model that determines the color and brightness of embedded texts by learning from the feature of real scene text images adaptively. The proposed technique has been evaluated over five public datasets and the experiments show its superior performance in training accurate and robust scene text detection and recognition models.

Abstract (translated)

在为各种视觉检测和识别任务训练深度神经网络模型时,对大量注释图像的要求已成为一项重大挑战。本文提出了一种新颖的图像合成技术,旨在生成大量带注释的场景文本图像,用于训练准确和鲁棒的场景文本检测和识别模型。所提出的技术包括三种创新设计。首先,它通过在背景图像内的语义敏感区域嵌入文本来实现“语义连贯”合成,其中通过利用在先前语义分割研究中创建的对象和图像区域的语义注释来实现语义一致性。其次,它利用视觉显着性来确定每个语义敏感区域内的嵌入位置,这与文本通常放置在同质区域周围以便在场景中获得更好可见性的事实一致。第三,设计了一种自适应文本外观模型,通过自适应地学习真实场景文本图像的特征,确定嵌入文本的颜色和亮度。所提出的技术已经在五个公共数据集上进行了评估,并且实验表明其在训练精确和稳健的场景文本检测和识别模型方面的优越性能。

URL

https://arxiv.org/abs/1807.03021

PDF

https://arxiv.org/pdf/1807.03021.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot