Abstract
Recent diffusion-based generators can produce high-quality images based only on textual prompts. However, they do not correctly interpret instructions that specify the spatial layout of the composition. We propose a simple approach that can achieve robust layout control without requiring training or fine-tuning the image generator. Our technique, which we call layout guidance, manipulates the cross-attention layers that the model uses to interface textual and visual information and steers the reconstruction in the desired direction given, e.g., a user-specified layout. In order to determine how to best guide attention, we study the role of different attention maps when generating images and experiment with two alternative strategies, forward and backward guidance. We evaluate our method quantitatively and qualitatively with several experiments, validating its effectiveness. We further demonstrate its versatility by extending layout guidance to the task of editing the layout and context of a given real image.
Abstract (translated)
最近的扩散based生成器只能基于文本提示生成高质量的图像。然而,它们并没有正确解释指定了组合空间布局的指示。我们提出了一种简单的方法,可以实现可靠的空间布局控制,而不需要训练或优化图像生成器。我们称之为布局指导的技术,操纵模型使用以交互式文本和视觉信息之间的交叉注意力层,并根据给定的方向引导重构,例如用户指定的布局。为了确定如何最好地引导注意力,我们研究了生成图像时不同注意力地图的作用,并进行了 forward 和 backward 指导的两个替代策略的实验。我们使用几个实验评估了我们的方法和性能,证明了其有效性。我们还通过将布局指导扩展到编辑给定真实图像的布局和上下文任务,展示了其多功能性。
URL
https://arxiv.org/abs/2304.03373