Abstract
To enhance the controllability of text-to-image diffusion models, existing efforts like ControlNet incorporated image-based conditional controls. In this paper, we reveal that existing methods still face significant challenges in generating images that align with the image conditional controls. To this end, we propose ControlNet++, a novel approach that improves controllable generation by explicitly optimizing pixel-level cycle consistency between generated images and conditional controls. Specifically, for an input conditional control, we use a pre-trained discriminative reward model to extract the corresponding condition of the generated images, and then optimize the consistency loss between the input conditional control and extracted condition. A straightforward implementation would be generating images from random noises and then calculating the consistency loss, but such an approach requires storing gradients for multiple sampling timesteps, leading to considerable time and memory costs. To address this, we introduce an efficient reward strategy that deliberately disturbs the input images by adding noise, and then uses the single-step denoised images for reward fine-tuning. This avoids the extensive costs associated with image sampling, allowing for more efficient reward fine-tuning. Extensive experiments show that ControlNet++ significantly improves controllability under various conditional controls. For example, it achieves improvements over ControlNet by 7.9% mIoU, 13.4% SSIM, and 7.6% RMSE, respectively, for segmentation mask, line-art edge, and depth conditions.
Abstract (translated)
为了增强文本到图像扩散模型的可控性,现有的诸如ControlNet的努力已经包含了图像为基础的条件控制。在本文中,我们揭示现有方法在生成与图像条件控制相符的图像方面仍然面临着重大挑战。为此,我们提出了ControlNet++,一种通过明确优化生成图像与条件控制之间像素级循环一致性来提高可控性新颖的方法。具体来说,对于输入条件控制,我们使用预训练的判别性奖励模型提取相应条件,然后优化输入条件与提取条件的一致性损失。一个简单的实现方法是从随机噪声中生成图像,然后计算一致性损失,但这种方法需要存储多个抽样时刻的梯度,导致大量的时间和内存成本。为了应对这个问题,我们引入了一种有效的奖励策略,故意在输入图像上添加噪声,然后使用单步去噪图像进行奖励微调。这避免了与图像采样相关的广泛成本,使得奖励微调更加高效。大量实验结果表明,ControlNet++在各种条件控制下显著提高了可控性。例如,在分割掩码、线形边和深度条件下,ControlNet++分别实现了7.9%的mIoU提高、13.4%的SSIM提高和7.6%的RMSE提高。
URL
https://arxiv.org/abs/2404.07987