Paper Reading AI Learner

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

2023-12-19 15:18:40
Lingjun Zhang, Xinyuan Chen, Yaohui Wang, Yue Lu, Yu Qiao

Abstract

Recently, diffusion-based image generation methods are credited for their remarkable text-to-image generation capabilities, while still facing challenges in accurately generating multilingual scene text images. To tackle this problem, we propose Diff-Text, which is a training-free scene text generation framework for any language. Our model outputs a photo-realistic image given a text of any language along with a textual description of a scene. The model leverages rendered sketch images as priors, thus arousing the potential multilingual-generation ability of the pre-trained Stable Diffusion. Based on the observation from the influence of the cross-attention map on object placement in generated images, we propose a localized attention constraint into the cross-attention layer to address the unreasonable positioning problem of scene text. Additionally, we introduce contrastive image-level prompts to further refine the position of the textual region and achieve more accurate scene text generation. Experiments demonstrate that our method outperforms the existing method in both the accuracy of text recognition and the naturalness of foreground-background blending.

Abstract (translated)

最近,基于扩散的图像生成方法以其在文本到图像生成方面的显著表现而受到赞誉,但在准确生成多语言场景文本图像方面仍然面临挑战。为解决这个问题,我们提出了Diff-Text,这是一个为任何语言的训练free场景文本生成框架。我们的模型根据任何语言的文本生成一张照片真实的图像,并提供了场景的文本描述。模型利用预训练的Stable Diffusion生成的渲染插图作为先验,从而激起先验的 multilingual-generation 能力。基于观察到交叉注意图对生成图像中物体对齐的影响,我们在交叉注意层引入局部注意力约束以解决场景文本不合理的对齐问题。此外,我们还引入了对比图像级的提示来进一步精细文本区域的定位,并实现更精确的场景文本生成。实验证明,我们的方法在文本识别的准确性和前景-背景融合的自然性方面超过了现有方法。

URL

https://arxiv.org/abs/2312.12232

PDF

https://arxiv.org/pdf/2312.12232.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot