Abstract
Generating images with accurately represented text, especially in non-Latin languages, poses a significant challenge for diffusion models. Existing approaches, such as the integration of hint condition diagrams via auxiliary networks (e.g., ControlNet), have made strides towards addressing this issue. However, diffusion models often fall short in tasks requiring controlled text generation, such as specifying particular fonts or producing text in small fonts. In this paper, we introduce a novel approach for multilingual visual text creation, named JoyType, designed to maintain the font style of text during the image generation process. Our methodology begins with assembling a training dataset, JoyType-1M, comprising 1 million pairs of data. Each pair includes an image, its description, and glyph instructions corresponding to the font style within the image. We then developed a text control network, Font ControlNet, tasked with extracting font style information to steer the image generation. To further enhance our model's ability to maintain font style, notably in generating small-font text, we incorporated a multi-layer OCR-aware loss into the diffusion process. This enhancement allows JoyType to direct text rendering using low-level descriptors. Our evaluations, based on both visual and accuracy metrics, demonstrate that JoyType significantly outperforms existing state-of-the-art methods. Additionally, JoyType can function as a plugin, facilitating the creation of varied image styles in conjunction with other stable diffusion models on HuggingFace and CivitAI. Our project is open-sourced on this https URL.
Abstract (translated)
使用准确表示文本生成图像,特别是在非拉丁字母语言中,对扩散模型来说是一个重大的挑战。现有的方法,如通过辅助网络将提示条件图整合(例如,ControlNet),已经朝着解决这个问题的方向迈出了步伐。然而,扩散模型在需要控制文本生成的任务中往往表现不足,例如指定特定的字体或生成小字体文本。在本文中,我们介绍了一种名为JoyType的多语言视觉文本创建方法,旨在在图像生成过程中保持文本的字体风格。我们的方法从包括100万对数据的JoyType-1M训练数据开始。每对数据包括图像、描述和与图像中字体风格对应的符号指令。然后我们开发了一个文本控制网络Font ControlNet,负责提取字体风格信息以引导图像生成。为了进一步提高模型在保持字体风格方面的能力,尤其是在生成小字体文本方面,我们将多层OCR感知损失集成到扩散过程中。这个增强使得JoyType能够使用低级描述符进行文本渲染。我们的评估基于视觉和准确度指标,表明JoyType显著优于现有的最先进方法。此外,JoyType可以作为一个插件,在其他稳定扩散模型上促进创建各种图像风格。我们的项目目前是对此https URL开放的源代码项目。
URL
https://arxiv.org/abs/2409.17524