Paper Reading AI Learner

Refining Text-to-Image Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation

2024-03-25 04:54:49
Sanyam Lakhanpal, Shivang Chopra, Vinija Jain, Aman Chadha, Man Luo

Abstract

Over the past few years, Text-to-Image (T2I) generation approaches based on diffusion models have gained significant attention. However, vanilla diffusion models often suffer from spelling inaccuracies in the text displayed within the generated images. The capability to generate visual text is crucial, offering both academic interest and a wide range of practical applications. To produce accurate visual text images, state-of-the-art techniques adopt a glyph-controlled image generation approach, consisting of a text layout generator followed by an image generator that is conditioned on the generated text layout. Nevertheless, our study reveals that these models still face three primary challenges, prompting us to develop a testbed to facilitate future research. We introduce a benchmark, LenCom-Eval, specifically designed for testing models' capability in generating images with Lengthy and Complex visual text. Subsequently, we introduce a training-free framework to enhance the two-stage generation approaches. We examine the effectiveness of our approach on both LenCom-Eval and MARIO-Eval benchmarks and demonstrate notable improvements across a range of evaluation metrics, including CLIPScore, OCR precision, recall, F1 score, accuracy, and edit distance scores. For instance, our proposed framework improves the backbone model, TextDiffuser, by more than 23\% and 13.5\% in terms of OCR word F1 on LenCom-Eval and MARIO-Eval, respectively. Our work makes a unique contribution to the field by focusing on generating images with long and rare text sequences, a niche previously unexplored by existing literature

Abstract (translated)

在过去的几年里,基于扩散模型的文本转图像(T2I)生成方法已经获得了显著的关注。然而,基本的扩散模型通常在生成的图像中显示的文本中存在拼写不准确的问题。生成视觉文本的能力至关重要,既具有学术意义,又具有广泛的应用价值。为了产生准确的视觉文本图像,最先进的技术采用了一种基于字符级别控制的图像生成方法,包括一个文本布局生成器和一个根据生成的文本布局进行条件的图像生成器。然而,我们的研究揭示了这些模型仍然面临三个主要挑战,促使我们开发一个测试平台来促进未来的研究。我们引入了一个专门为测试模型生成具有长篇和复杂视觉文本的图像而设计的基准,即LenCom-Eval。接着,我们引入了一个无需训练的框架来增强两种级联生成方法。我们在LenCom-Eval和MARIO-Eval基准上评估了我们的方法的有效性,并展示了在包括CLIPScore、OCR精度、召回、F1分数、准确性和编辑距离分数在内的各种评估指标上显着改善。例如,与基准模型相比,我们提出的框架在LenCom-Eval基准上提高了超过23%,而在MARIO-Eval基准上提高了13.5%。我们的工作为该领域通过专注于生成长篇和罕见文本序列的图像做出了独特的贡献,而这一领域之前尚未被现有文献所探索。

URL

https://arxiv.org/abs/2403.16422

PDF

https://arxiv.org/pdf/2403.16422.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot