Abstract
Text-to-image diffusion models are now capable of generating images that are often indistinguishable from real images. To generate such images, these models must understand the semantics of the objects they are asked to generate. In this work we show that, without any training, one can leverage this semantic knowledge within diffusion models to find semantic correspondences -- locations in multiple images that have the same semantic meaning. Specifically, given an image, we optimize the prompt embeddings of these models for maximum attention on the regions of interest. These optimized embeddings capture semantic information about the location, which can then be transferred to another image. By doing so we obtain results on par with the strongly supervised state of the art on the PF-Willow dataset and significantly outperform (20.9% relative for the SPair-71k dataset) any existing weakly or unsupervised method on PF-Willow, CUB-200 and SPair-71k datasets.
Abstract (translated)
文本到图像扩散模型现在能够生成往往与真实图像难以区分的图像。要生成这些图像,这些模型必须理解它们被要求生成的对象的意义。在本文中,我们表明,在没有训练的情况下,可以利用扩散模型中的语义知识找到语义对应物——在多个图像中具有相同语义意义的地点。具体来说,给定一个图像,我们优化这些模型的即时嵌入,以最大限度地关注感兴趣的区域。这些优化的嵌入捕获了关于位置的语义信息,然后可以转移到另一个图像。通过这样做,我们在PF-威尔逊数据集上的结果与强监督的最新进展相当,并且在PF-威尔逊、CUB-200和SPair-71k数据集上显著超越了任何现有的弱或无监督方法。
URL
https://arxiv.org/abs/2305.15581