Abstract
Recently, diffusion-based image generation methods are credited for their remarkable text-to-image generation capabilities, while still facing challenges in accurately generating multilingual scene text images. To tackle this problem, we propose Diff-Text, which is a training-free scene text generation framework for any language. Our model outputs a photo-realistic image given a text of any language along with a textual description of a scene. The model leverages rendered sketch images as priors, thus arousing the potential multilingual-generation ability of the pre-trained Stable Diffusion. Based on the observation from the influence of the cross-attention map on object placement in generated images, we propose a localized attention constraint into the cross-attention layer to address the unreasonable positioning problem of scene text. Additionally, we introduce contrastive image-level prompts to further refine the position of the textual region and achieve more accurate scene text generation. Experiments demonstrate that our method outperforms the existing method in both the accuracy of text recognition and the naturalness of foreground-background blending.
Abstract (translated)
最近,基于扩散的图像生成方法以其在文本到图像生成方面的显著表现而受到赞誉,但在准确生成多语言场景文本图像方面仍然面临挑战。为解决这个问题,我们提出了Diff-Text,这是一个为任何语言的训练free场景文本生成框架。我们的模型根据任何语言的文本生成一张照片真实的图像,并提供了场景的文本描述。模型利用预训练的Stable Diffusion生成的渲染插图作为先验,从而激起先验的 multilingual-generation 能力。基于观察到交叉注意图对生成图像中物体对齐的影响,我们在交叉注意层引入局部注意力约束以解决场景文本不合理的对齐问题。此外,我们还引入了对比图像级的提示来进一步精细文本区域的定位,并实现更精确的场景文本生成。实验证明,我们的方法在文本识别的准确性和前景-背景融合的自然性方面超过了现有方法。
URL
https://arxiv.org/abs/2312.12232