Abstract
Text-to-image (T2I) research has grown explosively in the past year, owing to the large-scale pre-trained diffusion models and many emerging personalization and editing approaches. Yet, one pain point persists: the text prompt engineering, and searching high-quality text prompts for customized results is more art than science. Moreover, as commonly argued: "an image is worth a thousand words" - the attempt to describe a desired image with texts often ends up being ambiguous and cannot comprehensively cover delicate visual details, hence necessitating more additional controls from the visual domain. In this paper, we take a bold step forward: taking "Text" out of a pre-trained T2I diffusion model, to reduce the burdensome prompt engineering efforts for users. Our proposed framework, Prompt-Free Diffusion, relies on only visual inputs to generate new images: it takes a reference image as "context", an optional image structural conditioning, and an initial noise, with absolutely no text prompt. The core architecture behind the scene is Semantic Context Encoder (SeeCoder), substituting the commonly used CLIP-based or LLM-based text encoder. The reusability of SeeCoder also makes it a convenient drop-in component: one can also pre-train a SeeCoder in one T2I model and reuse it for another. Through extensive experiments, Prompt-Free Diffusion is experimentally found to (i) outperform prior exemplar-based image synthesis approaches; (ii) perform on par with state-of-the-art T2I models using prompts following the best practice; and (iii) be naturally extensible to other downstream applications such as anime figure generation and virtual try-on, with promising quality. Our code and models are open-sourced at this https URL.
Abstract (translated)
图像生成文本(T2I)研究在过去一年里迅速发展,由于大规模训练扩散模型和许多新兴的个性化和编辑方法。然而,仍然存在一个痛点:文本 prompt engineering 和寻找高质量的自定义文本提示对于实现个性化的结果来说更像是艺术而不是科学。此外,正如通常所说的那样:“一个图像的价值在于它的千句话”——试图用文本描述想要的图像往往会导致模糊不清,不能全面覆盖微妙的视觉细节,因此需要更多的视觉域额外的控制。在本文中,我们将采取大胆一步:将“文本”从训练好的 T2I 扩散模型中移除,以减少用户的负担式的文本提示工程努力。我们提出的框架称为Prompt-Free Diffusion,它仅依赖于视觉输入生成新图像:它使用参考图像作为“上下文”,可选的图像结构 conditioning,以及最初的噪声,完全没有文本提示。在幕后的核心架构是语义上下文编码器(SeeCoder),替代了常用的CLIP 或 LLM 文本编码器。SeeCoder的可重用性也使其成为一个方便的升级组件:你也可以在其中一个 T2I 模型中预先训练 SeeCoder,并用它来另一个模型。通过广泛的实验,Prompt-Free Diffusion 实验表明(i)比先前基于示例的图像合成方法表现更好;(ii)与最先进的 T2I 模型使用最佳实践的提示运行水平相当;(iii)自然地可扩展到其他下游应用,如动画人物生成和虚拟试穿,具有出色的质量。我们的代码和模型在此 https URL 上开源。
URL
https://arxiv.org/abs/2305.16223