On Manipulating Scene Text in the Wild with Diffusion Models

Abstract
Abstract (translated)
URL
PDF

Abstract

Diffusion models have gained attention for image editing yielding impressive results in text-to-image tasks. On the downside, one might notice that generated images of stable diffusion models suffer from deteriorated details. This pitfall impacts image editing tasks that require information preservation e.g., scene text editing. As a desired result, the model must show the capability to replace the text on the source image to the target text while preserving the details e.g., color, font size, and background. To leverage the potential of diffusion models, in this work, we introduce Diffusion-BasEd Scene Text manipulation Network so-called DBEST. Specifically, we design two adaptation strategies, namely one-shot style adaptation and text-recognition guidance. In experiments, we thoroughly assess and compare our proposed method against state-of-the-arts on various scene text datasets, then provide extensive ablation studies for each granularity to analyze our performance gain. Also, we demonstrate the effectiveness of our proposed method to synthesize scene text indicated by competitive Optical Character Recognition (OCR) accuracy. Our method achieves 94.15% and 98.12% on COCO-text and ICDAR2013 datasets for character-level evaluation.

Abstract (translated)

扩散模型因在文本到图像任务中产生令人印象深刻的结果而受到了关注。然而，一个缺点是，生成的稳定扩散模型的图像在细节上会恶化。这个缺陷会影响需要保留信息的图像编辑任务，例如场景文本编辑。作为理想的结果，模型必须表现出在保留细节的同时将源图像上的文本替换为目标文本的能力，例如颜色、字体大小和背景。为了利用扩散模型的潜力，在本文中，我们引入了扩散-基于场景文本编辑网络DBEST。具体来说，我们设计了两项适应策略，即一次风格适应和文本识别指导。在实验中，我们深入评估和比较了我们的方法与现有技术的 various场景文本数据集上的表现，然后对每个粒度进行广泛的消融研究，以分析我们性能的提高。此外，我们还证明了我们方法的有效性，用于生成与竞争光学字符识别（OCR）准确性相同的场景文本。我们的方法在COCO-text和ICDAR2013数据集上的字符级别评估获得了94.15%和98.12%的准确率。

URL

https://arxiv.org/abs/2311.00734

PDF

https://arxiv.org/pdf/2311.00734.pdf

On Manipulating Scene Text in the Wild with Diffusion Models

Abstract

Abstract (translated)

URL

PDF Copy

PDF