Paper Reading AI Learner

Relation Rectification in Diffusion Model

2024-03-29 15:54:36
Yinwei Wu, Xingyi Yang, Xinchao Wang

Abstract

Despite their exceptional generative abilities, large text-to-image diffusion models, much like skilled but careless artists, often struggle with accurately depicting visual relationships between objects. This issue, as we uncover through careful analysis, arises from a misaligned text encoder that struggles to interpret specific relationships and differentiate the logical order of associated objects. To resolve this, we introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate. To address this, we propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN). It models the directional relationships between relation terms and corresponding objects within the input prompts. Specifically, we optimize the HGCN on a pair of prompts with identical relational words but reversed object orders, supplemented by a few reference images. The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space. Crucially, our method retains the parameters of the text encoder and diffusion model, preserving the model's robust performance on unrelated descriptions. We validated our approach on a newly curated dataset of diverse relational data, demonstrating both quantitative and qualitative enhancements in generating images with precise visual relations. Project page: this https URL.

Abstract (translated)

尽管它们具有出色的生成能力,大型文本到图像扩散模型(如熟练但粗心的艺术家)通常在准确描绘物体之间的视觉关系方面遇到困难。通过仔细分析,我们发现这一问题源于一个失衡的文本编码器,它难以解释具体的关系,并区分相关对象的逻辑顺序。为解决这个问题,我们引入了一个名为关系纠正的新任务,旨在优化模型以准确表示其最初无法生成的关系。为解决这一问题,我们提出了一种创新的方法利用异质图卷积网络(HGCN)。它通过输入提示来建模关系词汇之间和相应物体之间的方向关系。具体来说,我们在一对具有相同关系词但反向物体顺序的提示上优化HGCN,并补充了几个参考图像。轻量级的HGCN调整了由文本编码器生成的文本嵌入,确保了文本中关系的准确映射在嵌入空间中的反映。关键的是,我们的方法保留了文本编码器和解扩散模型的参数,保持模型在不相关描述上的稳健性能。我们在包含多样关系数据的新数据集中评估了我们的方法,证明了在生成精确视觉关系图片方面 both quantitative and qualitative enhancements。项目页面:this <https://this URL>.

URL

https://arxiv.org/abs/2403.20249

PDF

https://arxiv.org/pdf/2403.20249.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot