Scene Graph to Image Synthesis: Integrating CLIP Guidance with Graph Conditioning in Diffusion Models

Abstract
Abstract (translated)
URL
PDF

Abstract

Advancements in generative models have sparked significant interest in generating images while adhering to specific structural guidelines. Scene graph to image generation is one such task of generating images which are consistent with the given scene graph. However, the complexity of visual scenes poses a challenge in accurately aligning objects based on specified relations within the scene graph. Existing methods approach this task by first predicting a scene layout and generating images from these layouts using adversarial training. In this work, we introduce a novel approach to generate images from scene graphs which eliminates the need of predicting intermediate layouts. We leverage pre-trained text-to-image diffusion models and CLIP guidance to translate graph knowledge into images. Towards this, we first pre-train our graph encoder to align graph features with CLIP features of corresponding images using a GAN based training. Further, we fuse the graph features with CLIP embedding of object labels present in the given scene graph to create a graph consistent CLIP guided conditioning signal. In the conditioning input, object embeddings provide coarse structure of the image and graph features provide structural alignment based on relationships among objects. Finally, we fine tune a pre-trained diffusion model with the graph consistent conditioning signal with reconstruction and CLIP alignment loss. Elaborate experiments reveal that our method outperforms existing methods on standard benchmarks of COCO-stuff and Visual Genome dataset.

Abstract (translated)

生成模型的进步引发了在遵守特定结构指南的同时生成图像的浓厚兴趣。场景图到图像生成是生成与给定场景图一致的图像的一种任务。然而，视觉场景的复杂性使得根据指定关系准确对场景图中的对象进行对齐具有挑战性。现有的方法通过首先预测场景布局并使用对抗训练从布局中生成图像来解决这个问题。在这项工作中，我们引入了一种生成图像从场景图的新方法，该方法消除了预测中间布局的需求。我们利用预训练的文本到图像扩散模型和CLIP指导将图知识转化为图像。为此，我们首先通过基于GAN的训练将场景图特征与相应图像的CLIP特征对齐。进一步，我们将场景图特征与给定场景图中的物体标签的CLIP嵌入合并，创建了一个具有图一致性的CLIP指导条件信号。在条件输入中，物体嵌入提供了图像的粗结构，而图特征提供了基于物体之间关系结构的平滑对齐。最后，我们通过与图一致性条件信号和重构和CLIP对齐损失对预训练扩散模型进行微调。通过详细的实验，我们发现我们的方法在COCO-stuff和Visual Genome数据集的标准基准上超过了现有方法。

URL

https://arxiv.org/abs/2401.14111

PDF

https://arxiv.org/pdf/2401.14111.pdf

Scene Graph to Image Synthesis: Integrating CLIP Guidance with Graph Conditioning in Diffusion Models

Abstract

Abstract (translated)

URL

PDF Copy

PDF