Abstract
Large-scale text-to-image generative models have been a ground-breaking development in generative AI, with diffusion models showing their astounding ability to synthesize convincing images following an input text prompt. The goal of image editing research is to give users control over the generated images by modifying the text prompt. Current image editing techniques are susceptible to unintended modifications of regions outside the targeted area, such as on the background or on distractor objects which have some semantic or visual relationship with the targeted object. According to our experimental findings, inaccurate cross-attention maps are at the root of this problem. Based on this observation, we propose Dynamic Prompt Learning (DPL) to force cross-attention maps to focus on correct noun words in the text prompt. By updating the dynamic tokens for nouns in the textual input with the proposed leakage repairment losses, we achieve fine-grained image editing over particular objects while preventing undesired changes to other image regions. Our method DPL, based on the publicly available Stable Diffusion, is extensively evaluated on a wide range of images, and consistently obtains superior results both quantitatively (CLIP score, Structure-Dist) and qualitatively (on user-evaluation). We show improved prompt editing results for Word-Swap, Prompt Refinement, and Attention Re-weighting, especially for complex multi-object scenes.
Abstract (translated)
大规模文本到图像生成模型是生成AI领域的突破性进展,扩散模型表现出在输入文本 prompt 后生成令人信服的图像的能力。图像编辑研究的目标是通过修改文本 prompt 来让用户控制生成的图像。目前的图像编辑技术容易意外修改超出目标区域的区域,例如背景或与目标对象有某些语义或视觉关系的干扰对象。根据我们的实验结果,不准确的交叉注意力地图是这个问题的根源。基于这个观察,我们提出了动态Prompt Learning(DPL),以强制交叉注意力地图关注文本 prompt 中的正确名词单词。通过更新动态代币对名词的文本输入中的动态代币,我们实现了对特定物体的细粒度图像编辑,同时防止其他图像区域不必要的变化。基于公开可用的稳定扩散方法,我们对多种图像进行了广泛评估,并 consistently 获得了 quantitative(CLIP score,结构-dist)和 qualitative(用户评估)上卓越的结果。我们展示了改进的 prompt 编辑结果,包括单词交换、Prompt refinement 和注意力重新加权,特别是复杂多物体场景。
URL
https://arxiv.org/abs/2309.15664