Paper Reading AI Learner

Dynamic Prompt Learning: Addressing Cross-Attention Leakage for Text-Based Image Editing

2023-09-27 13:55:57
Kai Wang, Fei Yang, Shiqi Yang, Muhammad Atif Butt, Joost van de Weijer

Abstract

Large-scale text-to-image generative models have been a ground-breaking development in generative AI, with diffusion models showing their astounding ability to synthesize convincing images following an input text prompt. The goal of image editing research is to give users control over the generated images by modifying the text prompt. Current image editing techniques are susceptible to unintended modifications of regions outside the targeted area, such as on the background or on distractor objects which have some semantic or visual relationship with the targeted object. According to our experimental findings, inaccurate cross-attention maps are at the root of this problem. Based on this observation, we propose Dynamic Prompt Learning (DPL) to force cross-attention maps to focus on correct noun words in the text prompt. By updating the dynamic tokens for nouns in the textual input with the proposed leakage repairment losses, we achieve fine-grained image editing over particular objects while preventing undesired changes to other image regions. Our method DPL, based on the publicly available Stable Diffusion, is extensively evaluated on a wide range of images, and consistently obtains superior results both quantitatively (CLIP score, Structure-Dist) and qualitatively (on user-evaluation). We show improved prompt editing results for Word-Swap, Prompt Refinement, and Attention Re-weighting, especially for complex multi-object scenes.

Abstract (translated)

大规模文本到图像生成模型是生成AI领域的突破性进展,扩散模型表现出在输入文本 prompt 后生成令人信服的图像的能力。图像编辑研究的目标是通过修改文本 prompt 来让用户控制生成的图像。目前的图像编辑技术容易意外修改超出目标区域的区域,例如背景或与目标对象有某些语义或视觉关系的干扰对象。根据我们的实验结果,不准确的交叉注意力地图是这个问题的根源。基于这个观察,我们提出了动态Prompt Learning(DPL),以强制交叉注意力地图关注文本 prompt 中的正确名词单词。通过更新动态代币对名词的文本输入中的动态代币,我们实现了对特定物体的细粒度图像编辑,同时防止其他图像区域不必要的变化。基于公开可用的稳定扩散方法,我们对多种图像进行了广泛评估,并 consistently 获得了 quantitative(CLIP score,结构-dist)和 qualitative(用户评估)上卓越的结果。我们展示了改进的 prompt 编辑结果,包括单词交换、Prompt refinement 和注意力重新加权,特别是复杂多物体场景。

URL

https://arxiv.org/abs/2309.15664

PDF

https://arxiv.org/pdf/2309.15664.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot