Abstract
Semantic image synthesis, i.e., generating images from user-provided semantic label maps, is an important conditional image generation task as it allows to control both the content as well as the spatial layout of generated images. Although diffusion models have pushed the state of the art in generative image modeling, the iterative nature of their inference process makes them computationally demanding. Other approaches such as GANs are more efficient as they only need a single feed-forward pass for generation, but the image quality tends to suffer on large and diverse datasets. In this work, we propose a new class of GAN discriminators for semantic image synthesis that generates highly realistic images by exploiting feature backbone networks pre-trained for tasks such as image classification. We also introduce a new generator architecture with better context modeling and using cross-attention to inject noise into latent variables, leading to more diverse generated images. Our model, which we dub DP-SIMS, achieves state-of-the-art results in terms of image quality and consistency with the input label maps on ADE-20K, COCO-Stuff, and Cityscapes, surpassing recent diffusion models while requiring two orders of magnitude less compute for inference.
Abstract (translated)
语义图像合成,即从用户提供的语义标签图生成图像,是一个重要的有条件图像生成任务,因为它允许控制生成图像的内容和空间布局。尽管扩散模型在生成图像建模方面已经取得了最先进的结果,但它们的推理过程的迭代性使得它们在计算上具有挑战性。其他方法,如GANs,则更加高效,因为它们只需要一个单向前馈过程来进行生成,但在大型和多样化的数据集上,图像质量往往有所下降。在这项工作中,我们提出了一种新的GAN判别器,用于语义图像合成,它通过利用预训练的图像分类任务网络的特征骨架来生成高度逼真的图像。我们还引入了一种新的生成器架构,具有更好的上下文建模能力,并使用跨注意来注入噪声到潜在变量中,导致生成更具有多样性的图像。我们称之为DP-SIMS的模型在图像质量和与输入标签映射的 consistency方面实现了最先进的结果,超过了最近的扩散模型,而需要进行推理的两倍计算。
URL
https://arxiv.org/abs/2312.13314