Paper Reading AI Learner

TextDiffuser: Diffusion Models as Text Painters

2023-05-18 10:16:19
Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, Furu Wei

Abstract

Diffusion models have gained increasing attention for their impressive generation abilities but currently struggle with rendering accurate and coherent text. To address this issue, we introduce \textbf{TextDiffuser}, focusing on generating images with visually appealing text that is coherent with backgrounds. TextDiffuser consists of two stages: first, a Transformer model generates the layout of keywords extracted from text prompts, and then diffusion models generate images conditioned on the text prompt and the generated layout. Additionally, we contribute the first large-scale text images dataset with OCR annotations, \textbf{MARIO-10M}, containing 10 million image-text pairs with text recognition, detection, and character-level segmentation annotations. We further collect the \textbf{MARIO-Eval} benchmark to serve as a comprehensive tool for evaluating text rendering quality. Through experiments and user studies, we show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text. The code, model, and dataset will be available at \url{this https URL}.

Abstract (translated)

扩散模型因其出色的生成能力而日益受到关注,但目前它们面临着与准确和连贯的文本渲染有关的困难。为了解决这个问题,我们引入了 \textbf{TextDiffuser},专注于生成视觉效果良好的文本与背景协调的图片。TextDiffuser分为两个阶段:首先,一个Transformer模型从文本提示中提取关键词的布局,然后扩散模型根据文本提示和生成的布局生成图像。此外,我们贡献了一个包含10百万张文本图像和OCR标注的大规模图像文本数据集 \textbf{MARIO-10M},其中包含文本识别、检测和字符级别的分割标注。我们还收集了 \textbf{MARIO-Eval} 基准数据集,作为评估文本渲染质量的全面工具。通过实验和用户研究,我们表明TextDiffuser是灵活且可控制地使用文本提示或与文本模板图像一起使用来生成高质量的文本图像,并进行文本填充以恢复不完整的图像。代码、模型和数据集将位于 \url{this https URL}。

URL

https://arxiv.org/abs/2305.10855

PDF

https://arxiv.org/pdf/2305.10855.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot