Paper Reading AI Learner

DesignEdit: Multi-Layered Latent Decomposition and Fusion for Unified & Accurate Image Editing

2024-03-21 15:35:42
Yueru Jia, Yuhui Yuan, Aosong Cheng, Chuke Wang, Ji Li, Huizhu Jia, Shanghang Zhang

Abstract

Recently, how to achieve precise image editing has attracted increasing attention, especially given the remarkable success of text-to-image generation models. To unify various spatial-aware image editing abilities into one framework, we adopt the concept of layers from the design domain to manipulate objects flexibly with various operations. The key insight is to transform the spatial-aware image editing task into a combination of two sub-tasks: multi-layered latent decomposition and multi-layered latent fusion. First, we segment the latent representations of the source images into multiple layers, which include several object layers and one incomplete background layer that necessitates reliable inpainting. To avoid extra tuning, we further explore the inner inpainting ability within the self-attention mechanism. We introduce a key-masking self-attention scheme that can propagate the surrounding context information into the masked region while mitigating its impact on the regions outside the mask. Second, we propose an instruction-guided latent fusion that pastes the multi-layered latent representations onto a canvas latent. We also introduce an artifact suppression scheme in the latent space to enhance the inpainting quality. Due to the inherent modular advantages of such multi-layered representations, we can achieve accurate image editing, and we demonstrate that our approach consistently surpasses the latest spatial editing methods, including Self-Guidance and DiffEditor. Last, we show that our approach is a unified framework that supports various accurate image editing tasks on more than six different editing tasks.

Abstract (translated)

近年来,如何实现精确的图像编辑引起了越来越多的关注,尤其是在文本到图像生成模型的显著成功的情况下。为了将各种空间感知图像编辑能力统一到一个框架中,我们采用设计领域中的层的概念,通过各种操作灵活地操纵物体。关键见解是将空间感知图像编辑任务转化为两个子任务:多层潜在分解和多层潜在融合。首先,我们将原始图像的潜在表示分割成多层,包括几个物体层和一个需要修复的残缺背景层。为了避免额外调整,我们进一步研究了自注意力机制内的自修复能力。我们引入了一种关键掩码自注意力方案,可以在掩码区域传播周围上下文信息,同时减轻其对 mask 之外区域的影響。其次,我们提出了一个指令引导的潜在融合,将多层潜在表示剪辑到画布潜在。我们还引入了在潜在空间中的异常抑制方案,以提高修复质量。由于这种多层表示的固有模块化优势,我们可以实现精确的图像编辑,并且我们证明了我们的方法 consistently超越了包括自指导化和差异编辑的最新空间编辑方法。最后,我们证明了我们的方法是一个统一框架,支持各种不同的图像编辑任务,包括六种不同的编辑任务。

URL

https://arxiv.org/abs/2403.14487

PDF

https://arxiv.org/pdf/2403.14487.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot