Paper Reading AI Learner

Obtaining Favorable Layouts for Multiple Object Generation

2024-05-01 18:07:48
Barak Battash, Amit Rozner, Lior Wolf, Ofir Lindenbaum

Abstract

Large-scale text-to-image models that can generate high-quality and diverse images based on textual prompts have shown remarkable success. These models aim ultimately to create complex scenes, and addressing the challenge of multi-subject generation is a critical step towards this goal. However, the existing state-of-the-art diffusion models face difficulty when generating images that involve multiple subjects. When presented with a prompt containing more than one subject, these models may omit some subjects or merge them together. To address this challenge, we propose a novel approach based on a guiding principle. We allow the diffusion model to initially propose a layout, and then we rearrange the layout grid. This is achieved by enforcing cross-attention maps (XAMs) to adhere to proposed masks and by migrating pixels from latent maps to new locations determined by us. We introduce new loss terms aimed at reducing XAM entropy for clearer spatial definition of subjects, reduce the overlap between XAMs, and ensure that XAMs align with their respective masks. We contrast our approach with several alternative methods and show that it more faithfully captures the desired concepts across a variety of text prompts.

Abstract (translated)

基于文本提示的大规模文本-图像模型已经取得了显著的成功。这些模型的最终目标是创建高质量和多样化的图像。解决多主体生成任务的挑战是实现这一目标的关键一步。然而,现有的最先进的扩散模型在生成涉及多个主题的图像时遇到困难。当面临一个包含多个主题的提示时,这些模型可能会忽略一些主题或将它们合并在一起。为了应对这一挑战,我们提出了一种基于指导原则的新方法。我们允许扩散模型最初提出一个布局,然后重新排列布局网格。这是通过强制跨注意图(XAMs)遵守所提出的掩码,并从潜在地图中将像素迁移到我们确定的新位置来实现的。我们引入了新的损失函数,旨在减少XAM熵,以更清晰地定义主题,减少XAM之间的重叠,并确保XAM与它们的相应掩码保持一致。我们对比了我们的方法与几种替代方法,并证明了它在各种文本提示上更准确地捕捉了所需的概念。

URL

https://arxiv.org/abs/2405.00791

PDF

https://arxiv.org/pdf/2405.00791.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot