Paper Reading AI Learner

On Manipulating Scene Text in the Wild with Diffusion Models

2023-11-01 11:31:50
Joshua Santoso, Christian Simon, Williem Pao

Abstract

Diffusion models have gained attention for image editing yielding impressive results in text-to-image tasks. On the downside, one might notice that generated images of stable diffusion models suffer from deteriorated details. This pitfall impacts image editing tasks that require information preservation e.g., scene text editing. As a desired result, the model must show the capability to replace the text on the source image to the target text while preserving the details e.g., color, font size, and background. To leverage the potential of diffusion models, in this work, we introduce Diffusion-BasEd Scene Text manipulation Network so-called DBEST. Specifically, we design two adaptation strategies, namely one-shot style adaptation and text-recognition guidance. In experiments, we thoroughly assess and compare our proposed method against state-of-the-arts on various scene text datasets, then provide extensive ablation studies for each granularity to analyze our performance gain. Also, we demonstrate the effectiveness of our proposed method to synthesize scene text indicated by competitive Optical Character Recognition (OCR) accuracy. Our method achieves 94.15% and 98.12% on COCO-text and ICDAR2013 datasets for character-level evaluation.

Abstract (translated)

扩散模型因在文本到图像任务中产生令人印象深刻的结果而受到了关注。然而,一个缺点是,生成的稳定扩散模型的图像在细节上会恶化。这个缺陷会影响需要保留信息的图像编辑任务,例如场景文本编辑。作为理想的结果,模型必须表现出在保留细节的同时将源图像上的文本替换为目标文本的能力,例如颜色、字体大小和背景。为了利用扩散模型的潜力,在本文中,我们引入了扩散-基于场景文本编辑网络DBEST。具体来说,我们设计了两项适应策略,即一次风格适应和文本识别指导。在实验中,我们深入评估和比较了我们的方法与现有技术的 various场景文本数据集上的表现,然后对每个粒度进行广泛的消融研究,以分析我们性能的提高。此外,我们还证明了我们方法的有效性,用于生成与竞争光学字符识别(OCR)准确性相同的场景文本。我们的方法在COCO-text和ICDAR2013数据集上的字符级别评估获得了94.15%和98.12%的准确率。

URL

https://arxiv.org/abs/2311.00734

PDF

https://arxiv.org/pdf/2311.00734.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot