Refining Text-to-Image Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation

Abstract
Abstract (translated)
URL
PDF

Abstract

Over the past few years, Text-to-Image (T2I) generation approaches based on diffusion models have gained significant attention. However, vanilla diffusion models often suffer from spelling inaccuracies in the text displayed within the generated images. The capability to generate visual text is crucial, offering both academic interest and a wide range of practical applications. To produce accurate visual text images, state-of-the-art techniques adopt a glyph-controlled image generation approach, consisting of a text layout generator followed by an image generator that is conditioned on the generated text layout. Nevertheless, our study reveals that these models still face three primary challenges, prompting us to develop a testbed to facilitate future research. We introduce a benchmark, LenCom-Eval, specifically designed for testing models' capability in generating images with Lengthy and Complex visual text. Subsequently, we introduce a training-free framework to enhance the two-stage generation approaches. We examine the effectiveness of our approach on both LenCom-Eval and MARIO-Eval benchmarks and demonstrate notable improvements across a range of evaluation metrics, including CLIPScore, OCR precision, recall, F1 score, accuracy, and edit distance scores. For instance, our proposed framework improves the backbone model, TextDiffuser, by more than 23\% and 13.5\% in terms of OCR word F1 on LenCom-Eval and MARIO-Eval, respectively. Our work makes a unique contribution to the field by focusing on generating images with long and rare text sequences, a niche previously unexplored by existing literature

Abstract (translated)

在过去的几年里，基于扩散模型的文本转图像（T2I）生成方法已经获得了显著的关注。然而，基本的扩散模型通常在生成的图像中显示的文本中存在拼写不准确的问题。生成视觉文本的能力至关重要，既具有学术意义，又具有广泛的应用价值。为了产生准确的视觉文本图像，最先进的技术采用了一种基于字符级别控制的图像生成方法，包括一个文本布局生成器和一个根据生成的文本布局进行条件的图像生成器。然而，我们的研究揭示了这些模型仍然面临三个主要挑战，促使我们开发一个测试平台来促进未来的研究。我们引入了一个专门为测试模型生成具有长篇和复杂视觉文本的图像而设计的基准，即LenCom-Eval。接着，我们引入了一个无需训练的框架来增强两种级联生成方法。我们在LenCom-Eval和MARIO-Eval基准上评估了我们的方法的有效性，并展示了在包括CLIPScore、OCR精度、召回、F1分数、准确性和编辑距离分数在内的各种评估指标上显着改善。例如，与基准模型相比，我们提出的框架在LenCom-Eval基准上提高了超过23%，而在MARIO-Eval基准上提高了13.5%。我们的工作为该领域通过专注于生成长篇和罕见文本序列的图像做出了独特的贡献，而这一领域之前尚未被现有文献所探索。

URL

https://arxiv.org/abs/2403.16422

PDF

https://arxiv.org/pdf/2403.16422.pdf

Refining Text-to-Image Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation

Abstract

Abstract (translated)

URL

PDF Copy

PDF