Abstract
Visual text rendering poses a fundamental challenge for contemporary text-to-image generation models, with the core problem lying in text encoder deficiencies. To achieve accurate text rendering, we identify two crucial requirements for text encoders: character awareness and alignment with glyphs. Our solution involves crafting a series of customized text encoder, Glyph-ByT5, by fine-tuning the character-aware ByT5 encoder using a meticulously curated paired glyph-text dataset. We present an effective method for integrating Glyph-ByT5 with SDXL, resulting in the creation of the Glyph-SDXL model for design image generation. This significantly enhances text rendering accuracy, improving it from less than $20\%$ to nearly $90\%$ on our design image benchmark. Noteworthy is Glyph-SDXL's newfound ability for text paragraph rendering, achieving high spelling accuracy for tens to hundreds of characters with automated multi-line layouts. Finally, through fine-tuning Glyph-SDXL with a small set of high-quality, photorealistic images featuring visual text, we showcase a substantial improvement in scene text rendering capabilities in open-domain real images. These compelling outcomes aim to encourage further exploration in designing customized text encoders for diverse and challenging tasks.
Abstract (translated)
视觉文本渲染对当代文本-图像生成模型来说是一个基本挑战,其核心问题在于文本编码器的不足。为了实现准确的文本渲染,我们提出了两个关键需求:字符意识和与字符相关的对齐。我们的解决方案是通过微调带有字符意识的ByT5编码器,使用精心挑选的配对字符-文本数据集进行微调,来创建一个自定义的文本编码器Glyph-ByT5。我们展示了将Glyph-ByT5与SDXL集成有效的方法,从而创建了Glyph-SDXL模型,用于设计图像生成。这使得文本渲染准确性大大提高,从不到20%提高到了几乎90%。值得注意的是,Glyph-SDXL在文本段落渲染方面表现出的新能力,实现了对数十到数百个字符的高拼写准确度,并采用自动多行布局。最后,通过用包含视觉文本的高质量、实拍图像微调Glyph-SDXL,我们在开放域真实图像中展示了场景文本渲染能力的重大改进。这些引人注目的结果鼓励我们在各种具有挑战性的任务中进一步探索自定义文本编码器的设计。
URL
https://arxiv.org/abs/2403.09622