Abstract
Despite the success of generating high-quality images given any text prompts by diffusion-based generative models, prior works directly generate the entire images, but cannot provide object-wise manipulation capability. To support wider real applications like professional graphic design and digital artistry, images are frequently created and manipulated in multiple layers to offer greater flexibility and control. Therefore in this paper, we propose a layer-collaborative diffusion model, named LayerDiff, specifically designed for text-guided, multi-layered, composable image synthesis. The composable image consists of a background layer, a set of foreground layers, and associated mask layers for each foreground element. To enable this, LayerDiff introduces a layer-based generation paradigm incorporating multiple layer-collaborative attention modules to capture inter-layer patterns. Specifically, an inter-layer attention module is designed to encourage information exchange and learning between layers, while a text-guided intra-layer attention module incorporates layer-specific prompts to direct the specific-content generation for each layer. A layer-specific prompt-enhanced module better captures detailed textual cues from the global prompt. Additionally, a self-mask guidance sampling strategy further unleashes the model's ability to generate multi-layered images. We also present a pipeline that integrates existing perceptual and generative models to produce a large dataset of high-quality, text-prompted, multi-layered images. Extensive experiments demonstrate that our LayerDiff model can generate high-quality multi-layered images with performance comparable to conventional whole-image generation methods. Moreover, LayerDiff enables a broader range of controllable generative applications, including layer-specific image editing and style transfer.
Abstract (translated)
尽管基于扩散模型的生成模型可以生成高质量的图像,但先前的作品直接生成整个图像,而无法提供对象级操作能力。为了支持更广泛的现实应用,如专业图形设计和数字艺术,通常在多层中创建和编辑图像以提供更大的灵活性和控制。因此,在本文中,我们提出了一个分层合作扩散模型,名为LayerDiff,专门为文本指导的多层可合成图像生成而设计。可合成图像由背景层、一组前景层和相关掩码层组成每个前景元素。为了实现这一目标,LayerDiff引入了一个基于层级的生成范式,包括多个层级合作注意模块来捕捉层间模式。具体来说,一个层级注意力模块被设计为鼓励层间信息交流和学习,而文本指导的内层注意力模块包括层级特定的提示以指导每个层的具体内容生成。层级特定提示增强模块更好地捕捉全局提示中的详细文本线索。此外,自掩码引导采样策略进一步释放了模型的多层图像生成能力。我们还提出了一个将现有的感知和生成模型集成到一起的生产高质量多层文本指导图像的流水线。大量实验证明,我们的LayerDiff模型可以在性能上与传统整张图像生成方法相媲美。此外,LayerDiff还允许更广泛的可控制生成应用,包括层级特定图像编辑和风格转移。
URL
https://arxiv.org/abs/2403.11929