Abstract
Generating visual text in natural scene images is a challenging task with many unsolved problems. Different from generating text on artificially designed images (such as posters, covers, cartoons, etc.), the text in natural scene images needs to meet the following four key criteria: (1) Fidelity: the generated text should appear as realistic as a photograph and be completely accurate, with no errors in any of the strokes. (2) Reasonability: the text should be generated on reasonable carrier areas (such as boards, signs, walls, etc.), and the generated text content should also be relevant to the scene. (3) Utility: the generated text can facilitate to the training of natural scene OCR (Optical Character Recognition) tasks. (4) Controllability: The attribute of the text (such as font and color) should be controllable as this http URL this paper, we propose a two stage method, SceneVTG++, which simultaneously satisfies the four aspects mentioned above. SceneVTG++ consists of a Text Layout and Content Generator (TLCG) and a Controllable Local Text Diffusion (CLTD). The former utilizes the world knowledge of multi modal large language models to find reasonable text areas and recommend text content according to the nature scene background images, while the latter generates controllable multilingual text based on the diffusion model. Through extensive experiments, we respectively verified the effectiveness of TLCG and CLTD, and demonstrated the state-of-the-art text generation performance of SceneVTG++. In addition, the generated images have superior utility in OCR tasks like text detection and text recognition. Codes and datasets will be available.
Abstract (translated)
在自然场景图像中生成视觉文本是一项充满挑战的任务,许多问题尚未解决。与在人工设计的图像(如海报、封面、卡通等)上生成文字不同,在自然场景图像中的文字需要满足以下四个关键标准:(1) 真实性:生成的文字应该看起来像一张照片一样逼真,并且完全准确,没有任何笔画错误。(2) 合理性:文本应当出现在合理的载体区域(如板子、标识牌、墙壁等),并且所生成的文本内容也应与场景相关。(3) 实用性:生成的文本能够促进自然场景OCR(光学字符识别)任务的训练。(4) 可控性:文字属性(如字体和颜色)应该可以控制。 在这篇论文中,我们提出了一种两阶段方法——SceneVTG++,它同时满足上述四个方面的需求。SceneVTG++由文本布局及内容生成器(TLCG) 和可控局部文本扩散(CLTD)组成。前者利用多模态大型语言模型的世界知识来寻找合理的文字区域并根据自然场景背景图像推荐文字内容,而后者则基于扩散模型生成可控制的多语言文本。通过广泛的实验,我们分别验证了TLCG和CLTD的有效性,并展示了SceneVTG++在文本生成性能方面的先进水平。此外,所生成的图像在OCR任务(如文本检测、文本识别)中具有极高的实用性。代码及数据集将会公开提供。
URL
https://arxiv.org/abs/2501.02962