Paper Reading AI Learner

JoyType: A Robust Design for Multilingual Visual Text Creation

2024-09-26 04:23:17
Chao Li, Chen Jiang, Xiaolong Liu, Jun Zhao, Guoxin Wang

Abstract

Generating images with accurately represented text, especially in non-Latin languages, poses a significant challenge for diffusion models. Existing approaches, such as the integration of hint condition diagrams via auxiliary networks (e.g., ControlNet), have made strides towards addressing this issue. However, diffusion models often fall short in tasks requiring controlled text generation, such as specifying particular fonts or producing text in small fonts. In this paper, we introduce a novel approach for multilingual visual text creation, named JoyType, designed to maintain the font style of text during the image generation process. Our methodology begins with assembling a training dataset, JoyType-1M, comprising 1 million pairs of data. Each pair includes an image, its description, and glyph instructions corresponding to the font style within the image. We then developed a text control network, Font ControlNet, tasked with extracting font style information to steer the image generation. To further enhance our model's ability to maintain font style, notably in generating small-font text, we incorporated a multi-layer OCR-aware loss into the diffusion process. This enhancement allows JoyType to direct text rendering using low-level descriptors. Our evaluations, based on both visual and accuracy metrics, demonstrate that JoyType significantly outperforms existing state-of-the-art methods. Additionally, JoyType can function as a plugin, facilitating the creation of varied image styles in conjunction with other stable diffusion models on HuggingFace and CivitAI. Our project is open-sourced on this https URL.

Abstract (translated)

使用准确表示文本生成图像,特别是在非拉丁字母语言中,对扩散模型来说是一个重大的挑战。现有的方法,如通过辅助网络将提示条件图整合(例如,ControlNet),已经朝着解决这个问题的方向迈出了步伐。然而,扩散模型在需要控制文本生成的任务中往往表现不足,例如指定特定的字体或生成小字体文本。在本文中,我们介绍了一种名为JoyType的多语言视觉文本创建方法,旨在在图像生成过程中保持文本的字体风格。我们的方法从包括100万对数据的JoyType-1M训练数据开始。每对数据包括图像、描述和与图像中字体风格对应的符号指令。然后我们开发了一个文本控制网络Font ControlNet,负责提取字体风格信息以引导图像生成。为了进一步提高模型在保持字体风格方面的能力,尤其是在生成小字体文本方面,我们将多层OCR感知损失集成到扩散过程中。这个增强使得JoyType能够使用低级描述符进行文本渲染。我们的评估基于视觉和准确度指标,表明JoyType显著优于现有的最先进方法。此外,JoyType可以作为一个插件,在其他稳定扩散模型上促进创建各种图像风格。我们的项目目前是对此https URL开放的源代码项目。

URL

https://arxiv.org/abs/2409.17524

PDF

https://arxiv.org/pdf/2409.17524.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot