Abstract
Diffusion models generate images with an unprecedented level of quality, but how can we freely rearrange image layouts? Recent works generate controllable scenes via learning spatially disentangled latent codes, but these methods do not apply to diffusion models due to their fixed forward process. In this work, we propose SceneDiffusion to optimize a layered scene representation during the diffusion sampling process. Our key insight is that spatial disentanglement can be obtained by jointly denoising scene renderings at different spatial layouts. Our generated scenes support a wide range of spatial editing operations, including moving, resizing, cloning, and layer-wise appearance editing operations, including object restyling and replacing. Moreover, a scene can be generated conditioned on a reference image, thus enabling object moving for in-the-wild images. Notably, this approach is training-free, compatible with general text-to-image diffusion models, and responsive in less than a second.
Abstract (translated)
扩散模型生成的图像具有前所未有的质量,但是我们如何自由地重新排列图像布局呢?最近的工作通过学习全局离散的潜在码来生成可控制场景,但这些方法不适用于扩散模型,因为它们固定了前处理过程。在本文中,我们提出SceneDiffusion来在扩散采样过程中优化分层场景表示。我们关键的见解是,通过在不同的空间布局下共同去噪场景渲染,可以获得空间去噪。我们生成的场景支持广泛的场景编辑操作,包括移动、缩放、克隆和逐层外观编辑操作,包括对象重新设计和替换。此外,可以根据参考图像生成场景,从而实现野外图片中对象的移动。值得注意的是,这种方法是训练-free的,兼容于通用的文本到图像扩散模型,并且响应时间不到一秒。
URL
https://arxiv.org/abs/2404.07178