Paper Reading AI Learner

Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering

2024-03-14 17:55:33
Zeyu Liu, Weicong Liang, Zhanhao Liang, Chong Luo, Ji Li, Gao Huang, Yuhui Yuan

Abstract

Visual text rendering poses a fundamental challenge for contemporary text-to-image generation models, with the core problem lying in text encoder deficiencies. To achieve accurate text rendering, we identify two crucial requirements for text encoders: character awareness and alignment with glyphs. Our solution involves crafting a series of customized text encoder, Glyph-ByT5, by fine-tuning the character-aware ByT5 encoder using a meticulously curated paired glyph-text dataset. We present an effective method for integrating Glyph-ByT5 with SDXL, resulting in the creation of the Glyph-SDXL model for design image generation. This significantly enhances text rendering accuracy, improving it from less than $20\%$ to nearly $90\%$ on our design image benchmark. Noteworthy is Glyph-SDXL's newfound ability for text paragraph rendering, achieving high spelling accuracy for tens to hundreds of characters with automated multi-line layouts. Finally, through fine-tuning Glyph-SDXL with a small set of high-quality, photorealistic images featuring visual text, we showcase a substantial improvement in scene text rendering capabilities in open-domain real images. These compelling outcomes aim to encourage further exploration in designing customized text encoders for diverse and challenging tasks.

Abstract (translated)

视觉文本渲染对当代文本-图像生成模型来说是一个基本挑战,其核心问题在于文本编码器的不足。为了实现准确的文本渲染,我们提出了两个关键需求:字符意识和与字符相关的对齐。我们的解决方案是通过微调带有字符意识的ByT5编码器,使用精心挑选的配对字符-文本数据集进行微调,来创建一个自定义的文本编码器Glyph-ByT5。我们展示了将Glyph-ByT5与SDXL集成有效的方法,从而创建了Glyph-SDXL模型,用于设计图像生成。这使得文本渲染准确性大大提高,从不到20%提高到了几乎90%。值得注意的是,Glyph-SDXL在文本段落渲染方面表现出的新能力,实现了对数十到数百个字符的高拼写准确度,并采用自动多行布局。最后,通过用包含视觉文本的高质量、实拍图像微调Glyph-SDXL,我们在开放域真实图像中展示了场景文本渲染能力的重大改进。这些引人注目的结果鼓励我们在各种具有挑战性的任务中进一步探索自定义文本编码器的设计。

URL

https://arxiv.org/abs/2403.09622

PDF

https://arxiv.org/pdf/2403.09622.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot