Abstract
Personalizing generative models offers a way to guide image generation with user-provided references. Current personalization methods can invert an object or concept into the textual conditioning space and compose new natural sentences for text-to-image diffusion models. However, representing and editing specific visual attributes like material, style, layout, etc. remains a challenge, leading to a lack of disentanglement and editability. To address this, we propose a novel approach that leverages the step-by-step generation process of diffusion models, which generate images from low- to high-frequency information, providing a new perspective on representing, generating, and editing images. We develop Prompt Spectrum Space P*, an expanded textual conditioning space, and a new image representation method called ProSpect. ProSpect represents an image as a collection of inverted textual token embeddings encoded from per-stage prompts, where each prompt corresponds to a specific generation stage (i.e., a group of consecutive steps) of the diffusion model. Experimental results demonstrate that P* and ProSpect offer stronger disentanglement and controllability compared to existing methods. We apply ProSpect in various personalized attribute-aware image generation applications, such as image/text-guided material/style/layout transfer/editing, achieving previously unattainable results with a single image input without fine-tuning the diffusion models.
Abstract (translated)
个性化生成模型提供了一个用户提供参考的方式来指导图像生成,提供了一种新的视角,可以代表、生成和编辑图像。目前,个性化方法可以将对象或概念翻转到文本 conditioning space 中,并为文本到图像扩散模型生成新的自然语句。然而,代表和编辑特定的视觉属性,如材料、风格、布局等仍然是一个挑战,导致缺乏分离性和编辑性。为了解决这个问题,我们提出了一种新方法,利用扩散模型的每一步生成过程,提供了新的代表、生成和编辑图像的视角。我们开发Prompt Spectrum Space P*、扩展了文本 conditioning space,并开发了一个新的图像表示方法,称为ProSpect。ProSpect 表示一个图像是一个从每个阶段引导的逆转文本 token embeddings 编码的集合,每个引导对应于扩散模型的特定生成阶段(即一组连续的步骤)。实验结果显示,P* 和 ProSpect 相比现有方法提供了更强的分离性和控制性。我们应用 ProSpect 在各种个性化属性aware图像生成应用程序中,如图像/文本引导的材料、风格、布局转移/编辑,通过单个图像输入实现了以前无法达到的结果,而不需要微调扩散模型。
URL
https://arxiv.org/abs/2305.16225