Paper Reading AI Learner

Control and Realism: Best of Both Worlds in Layout-to-Image without Training

2025-06-18 15:39:02
Bonan Li, Yinhan Hu, Songhua Liu, Xinchao Wang

Abstract

Layout-to-Image generation aims to create complex scenes with precise control over the placement and arrangement of subjects. Existing works have demonstrated that pre-trained Text-to-Image diffusion models can achieve this goal without training on any specific data; however, they often face challenges with imprecise localization and unrealistic artifacts. Focusing on these drawbacks, we propose a novel training-free method, WinWinLay. At its core, WinWinLay presents two key strategies, Non-local Attention Energy Function and Adaptive Update, that collaboratively enhance control precision and realism. On one hand, we theoretically demonstrate that the commonly used attention energy function introduces inherent spatial distribution biases, hindering objects from being uniformly aligned with layout instructions. To overcome this issue, non-local attention prior is explored to redistribute attention scores, facilitating objects to better conform to the specified spatial conditions. On the other hand, we identify that the vanilla backpropagation update rule can cause deviations from the pre-trained domain, leading to out-of-distribution artifacts. We accordingly introduce a Langevin dynamics-based adaptive update scheme as a remedy that promotes in-domain updating while respecting layout constraints. Extensive experiments demonstrate that WinWinLay excels in controlling element placement and achieving photorealistic visual fidelity, outperforming the current state-of-the-art methods.

Abstract (translated)

布局到图像生成的目标是创建具有对主体位置和排列精确控制的复杂场景。现有研究已经证明,预训练的文本到图像扩散模型能够在不进行特定数据训练的情况下实现这一目标;然而,它们常常面临定位不够准确和产生不合实际细节的问题。针对这些问题,我们提出了一种新的无需训练的方法——WinWinLay。在核心方面,WinWinLay提出了两种关键策略:非局部注意力能量函数(Non-local Attention Energy Function)和自适应更新机制(Adaptive Update),这两种策略共同提高了对图像控制的精度和现实感。 一方面,从理论上讲,我们证明了常用的注意力能量函数会引入固有的空间分布偏差,这阻碍了物体与其布局指令的一致性排列。为了解决这个问题,WinWinLay探索使用非局部注意机制来重新分配注意力得分,使物体能够更好地符合指定的空间条件。 另一方面,我们发现传统的反向传播更新规则可能会导致模型的输出偏离预训练的数据分布,从而产生不符合现实的细节(即“出界”现象)。为此,我们引入了一种基于兰杰文动力学(Langevin dynamics)的自适应更新方案来解决这个问题,该方法在保持布局约束的同时促进模型内部的更新。 广泛的实验表明,WinWinLay在控制元素放置和实现照片级真实感方面表现优异,超过了当前最先进的技术。

URL

https://arxiv.org/abs/2506.15563

PDF

https://arxiv.org/pdf/2506.15563.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot