Paper Reading AI Learner

Editing Implicit Assumptions in Text-to-Image Diffusion Models

2023-03-14 17:14:21
Hadas Orgad, Bahjat Kawar, Yonatan Belinkov

Abstract

Text-to-image diffusion models often make implicit assumptions about the world when generating images. While some assumptions are useful (e.g., the sky is blue), they can also be outdated, incorrect, or reflective of social biases present in the training data. Thus, there is a need to control these assumptions without requiring explicit user input or costly re-training. In this work, we aim to edit a given implicit assumption in a pre-trained diffusion model. Our Text-to-Image Model Editing method, TIME for short, receives a pair of inputs: a "source" under-specified prompt for which the model makes an implicit assumption (e.g., "a pack of roses"), and a "destination" prompt that describes the same setting, but with a specified desired attribute (e.g., "a pack of blue roses"). TIME then updates the model's cross-attention layers, as these layers assign visual meaning to textual tokens. We edit the projection matrices in these layers such that the source prompt is projected close to the destination prompt. Our method is highly efficient, as it modifies a mere 2.2% of the model's parameters in under one second. To evaluate model editing approaches, we introduce TIMED (TIME Dataset), containing 147 source and destination prompt pairs from various domains. Our experiments (using Stable Diffusion) show that TIME is successful in model editing, generalizes well for related prompts unseen during editing, and imposes minimal effect on unrelated generations.

Abstract (translated)

文本生成图像扩散模型在生成图像时往往对世界做出隐含假设。虽然某些假设有用(例如,天空是蓝色的),但它们也可能过时、不正确或反映了训练数据中的社会偏见。因此,我们需要控制这些假设,而不需要明确用户的输入或昂贵的重新训练。在这项工作中,我们的目标是编辑一个已训练扩散模型中的给定隐含假设。我们的文本生成图像模型编辑方法称为TIME,它接收两个输入:一个“源”未被指定prompt(例如“一束玫瑰”),和一个“目标”prompt,描述相同的场景,但具有指定的想要属性(例如“一束蓝色玫瑰”)。TIME更新模型的交叉注意力层,因为这些层将文本元表示赋予视觉意义。我们编辑这些层的投影矩阵,以便源prompt投影接近目标prompt。我们的方法非常高效,因为在仅一秒内它修改了模型参数的仅有2.2%。为了评估模型编辑方法,我们引入了TIMED(TIME数据集),其中包含来自不同领域147个源和目标prompt对。我们的实验(使用稳定扩散)表明,TIME在模型编辑中成功,对在编辑期间未看到的相关 prompt 具有很好的泛化能力,并对无关的生成产生了最小的影响。

URL

https://arxiv.org/abs/2303.08084

PDF

https://arxiv.org/pdf/2303.08084.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot