Abstract
Diffusion models have gained increasing attention for their impressive generation abilities but currently struggle with rendering accurate and coherent text. To address this issue, we introduce \textbf{TextDiffuser}, focusing on generating images with visually appealing text that is coherent with backgrounds. TextDiffuser consists of two stages: first, a Transformer model generates the layout of keywords extracted from text prompts, and then diffusion models generate images conditioned on the text prompt and the generated layout. Additionally, we contribute the first large-scale text images dataset with OCR annotations, \textbf{MARIO-10M}, containing 10 million image-text pairs with text recognition, detection, and character-level segmentation annotations. We further collect the \textbf{MARIO-Eval} benchmark to serve as a comprehensive tool for evaluating text rendering quality. Through experiments and user studies, we show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text. The code, model, and dataset will be available at \url{this https URL}.
Abstract (translated)
扩散模型因其出色的生成能力而日益受到关注,但目前它们面临着与准确和连贯的文本渲染有关的困难。为了解决这个问题,我们引入了 \textbf{TextDiffuser},专注于生成视觉效果良好的文本与背景协调的图片。TextDiffuser分为两个阶段:首先,一个Transformer模型从文本提示中提取关键词的布局,然后扩散模型根据文本提示和生成的布局生成图像。此外,我们贡献了一个包含10百万张文本图像和OCR标注的大规模图像文本数据集 \textbf{MARIO-10M},其中包含文本识别、检测和字符级别的分割标注。我们还收集了 \textbf{MARIO-Eval} 基准数据集,作为评估文本渲染质量的全面工具。通过实验和用户研究,我们表明TextDiffuser是灵活且可控制地使用文本提示或与文本模板图像一起使用来生成高质量的文本图像,并进行文本填充以恢复不完整的图像。代码、模型和数据集将位于 \url{this https URL}。
URL
https://arxiv.org/abs/2305.10855