Paper Reading AI Learner

Scene Graph to Image Synthesis: Integrating CLIP Guidance with Graph Conditioning in Diffusion Models

2024-01-25 11:46:31
Rameshwar Mishra, A V Subramanyam

Abstract

Advancements in generative models have sparked significant interest in generating images while adhering to specific structural guidelines. Scene graph to image generation is one such task of generating images which are consistent with the given scene graph. However, the complexity of visual scenes poses a challenge in accurately aligning objects based on specified relations within the scene graph. Existing methods approach this task by first predicting a scene layout and generating images from these layouts using adversarial training. In this work, we introduce a novel approach to generate images from scene graphs which eliminates the need of predicting intermediate layouts. We leverage pre-trained text-to-image diffusion models and CLIP guidance to translate graph knowledge into images. Towards this, we first pre-train our graph encoder to align graph features with CLIP features of corresponding images using a GAN based training. Further, we fuse the graph features with CLIP embedding of object labels present in the given scene graph to create a graph consistent CLIP guided conditioning signal. In the conditioning input, object embeddings provide coarse structure of the image and graph features provide structural alignment based on relationships among objects. Finally, we fine tune a pre-trained diffusion model with the graph consistent conditioning signal with reconstruction and CLIP alignment loss. Elaborate experiments reveal that our method outperforms existing methods on standard benchmarks of COCO-stuff and Visual Genome dataset.

Abstract (translated)

生成模型的进步引发了在遵守特定结构指南的同时生成图像的浓厚兴趣。场景图到图像生成是生成与给定场景图一致的图像的一种任务。然而,视觉场景的复杂性使得根据指定关系准确对场景图中的对象进行对齐具有挑战性。现有的方法通过首先预测场景布局并使用对抗训练从布局中生成图像来解决这个问题。在这项工作中,我们引入了一种生成图像从场景图的新方法,该方法消除了预测中间布局的需求。我们利用预训练的文本到图像扩散模型和CLIP指导将图知识转化为图像。为此,我们首先通过基于GAN的训练将场景图特征与相应图像的CLIP特征对齐。进一步,我们将场景图特征与给定场景图中的物体标签的CLIP嵌入合并,创建了一个具有图一致性的CLIP指导条件信号。在条件输入中,物体嵌入提供了图像的粗结构,而图特征提供了基于物体之间关系结构的平滑对齐。最后,我们通过与图一致性条件信号和重构和CLIP对齐损失对预训练扩散模型进行微调。通过详细的实验,我们发现我们的方法在COCO-stuff和Visual Genome数据集的标准基准上超过了现有方法。

URL

https://arxiv.org/abs/2401.14111

PDF

https://arxiv.org/pdf/2401.14111.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot