Abstract
Recently, how to achieve precise image editing has attracted increasing attention, especially given the remarkable success of text-to-image generation models. To unify various spatial-aware image editing abilities into one framework, we adopt the concept of layers from the design domain to manipulate objects flexibly with various operations. The key insight is to transform the spatial-aware image editing task into a combination of two sub-tasks: multi-layered latent decomposition and multi-layered latent fusion. First, we segment the latent representations of the source images into multiple layers, which include several object layers and one incomplete background layer that necessitates reliable inpainting. To avoid extra tuning, we further explore the inner inpainting ability within the self-attention mechanism. We introduce a key-masking self-attention scheme that can propagate the surrounding context information into the masked region while mitigating its impact on the regions outside the mask. Second, we propose an instruction-guided latent fusion that pastes the multi-layered latent representations onto a canvas latent. We also introduce an artifact suppression scheme in the latent space to enhance the inpainting quality. Due to the inherent modular advantages of such multi-layered representations, we can achieve accurate image editing, and we demonstrate that our approach consistently surpasses the latest spatial editing methods, including Self-Guidance and DiffEditor. Last, we show that our approach is a unified framework that supports various accurate image editing tasks on more than six different editing tasks.
Abstract (translated)
近年来,如何实现精确的图像编辑引起了越来越多的关注,尤其是在文本到图像生成模型的显著成功的情况下。为了将各种空间感知图像编辑能力统一到一个框架中,我们采用设计领域中的层的概念,通过各种操作灵活地操纵物体。关键见解是将空间感知图像编辑任务转化为两个子任务:多层潜在分解和多层潜在融合。首先,我们将原始图像的潜在表示分割成多层,包括几个物体层和一个需要修复的残缺背景层。为了避免额外调整,我们进一步研究了自注意力机制内的自修复能力。我们引入了一种关键掩码自注意力方案,可以在掩码区域传播周围上下文信息,同时减轻其对 mask 之外区域的影響。其次,我们提出了一个指令引导的潜在融合,将多层潜在表示剪辑到画布潜在。我们还引入了在潜在空间中的异常抑制方案,以提高修复质量。由于这种多层表示的固有模块化优势,我们可以实现精确的图像编辑,并且我们证明了我们的方法 consistently超越了包括自指导化和差异编辑的最新空间编辑方法。最后,我们证明了我们的方法是一个统一框架,支持各种不同的图像编辑任务,包括六种不同的编辑任务。
URL
https://arxiv.org/abs/2403.14487