Paper Reading AI Learner

Unlocking Pre-trained Image Backbones for Semantic Image Synthesis

2023-12-20 09:39:19
Tariq Berrada, Jakob Verbeek, Camille Couprie, Karteek Alahari

Abstract

Semantic image synthesis, i.e., generating images from user-provided semantic label maps, is an important conditional image generation task as it allows to control both the content as well as the spatial layout of generated images. Although diffusion models have pushed the state of the art in generative image modeling, the iterative nature of their inference process makes them computationally demanding. Other approaches such as GANs are more efficient as they only need a single feed-forward pass for generation, but the image quality tends to suffer on large and diverse datasets. In this work, we propose a new class of GAN discriminators for semantic image synthesis that generates highly realistic images by exploiting feature backbone networks pre-trained for tasks such as image classification. We also introduce a new generator architecture with better context modeling and using cross-attention to inject noise into latent variables, leading to more diverse generated images. Our model, which we dub DP-SIMS, achieves state-of-the-art results in terms of image quality and consistency with the input label maps on ADE-20K, COCO-Stuff, and Cityscapes, surpassing recent diffusion models while requiring two orders of magnitude less compute for inference.

Abstract (translated)

语义图像合成,即从用户提供的语义标签图生成图像,是一个重要的有条件图像生成任务,因为它允许控制生成图像的内容和空间布局。尽管扩散模型在生成图像建模方面已经取得了最先进的结果,但它们的推理过程的迭代性使得它们在计算上具有挑战性。其他方法,如GANs,则更加高效,因为它们只需要一个单向前馈过程来进行生成,但在大型和多样化的数据集上,图像质量往往有所下降。在这项工作中,我们提出了一种新的GAN判别器,用于语义图像合成,它通过利用预训练的图像分类任务网络的特征骨架来生成高度逼真的图像。我们还引入了一种新的生成器架构,具有更好的上下文建模能力,并使用跨注意来注入噪声到潜在变量中,导致生成更具有多样性的图像。我们称之为DP-SIMS的模型在图像质量和与输入标签映射的 consistency方面实现了最先进的结果,超过了最近的扩散模型,而需要进行推理的两倍计算。

URL

https://arxiv.org/abs/2312.13314

PDF

https://arxiv.org/pdf/2312.13314.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot