Abstract
Recently, integrating visual controls into text-to-image~(T2I) models, such as ControlNet method, has received significant attention for finer control capabilities. While various training-free methods make efforts to enhance prompt following in T2I models, the issue with visual control is still rarely studied, especially in the scenario that visual controls are misaligned with text prompts. In this paper, we address the challenge of ``Prompt Following With Visual Control" and propose a training-free approach named Mask-guided Prompt Following (MGPF). Object masks are introduced to distinct aligned and misaligned parts of visual controls and prompts. Meanwhile, a network, dubbed as Masked ControlNet, is designed to utilize these object masks for object generation in the misaligned visual control region. Further, to improve attribute matching, a simple yet efficient loss is designed to align the attention maps of attributes with object regions constrained by ControlNet and object masks. The efficacy and superiority of MGPF are validated through comprehensive quantitative and qualitative experiments.
Abstract (translated)
近年来,将视觉控制集成到文本到图像(T2I)模型中,如ControlNet方法,因更精确的控制能力而受到广泛关注。虽然各种训练免费的方法努力增强T2I模型的提示跟随,但视觉控制的一个问题仍然很少被研究,尤其是在视觉控制与文本提示对齐的情况下。在本文中,我们解决了“与视觉控制一起跟踪提示”的挑战,并提出了一个无需训练的名为Mask-guided Prompt Following(MGPF)的训练免费方法。我们引入了物体掩码来对对齐和错位的视觉控制和提示的不同部分。同时,设计了一个名为Masked ControlNet的网络,用于在错位的视觉控制区域中利用这些物体掩码进行物体生成。此外,为了提高属性匹配,设计了一个简单而有效的损失,将属性的注意力图与受控网络和物体掩码约束的对象区域对齐。通过全面的定量实验和定性实验,验证了MGPF的有效性和优越性。
URL
https://arxiv.org/abs/2404.14768