Paper Reading AI Learner

ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback

2024-04-11 17:59:09
Ming Li, Taojiannan Yang, Huafeng Kuang, Jie Wu, Zhaoning Wang, Xuefeng Xiao, Chen Chen

Abstract

To enhance the controllability of text-to-image diffusion models, existing efforts like ControlNet incorporated image-based conditional controls. In this paper, we reveal that existing methods still face significant challenges in generating images that align with the image conditional controls. To this end, we propose ControlNet++, a novel approach that improves controllable generation by explicitly optimizing pixel-level cycle consistency between generated images and conditional controls. Specifically, for an input conditional control, we use a pre-trained discriminative reward model to extract the corresponding condition of the generated images, and then optimize the consistency loss between the input conditional control and extracted condition. A straightforward implementation would be generating images from random noises and then calculating the consistency loss, but such an approach requires storing gradients for multiple sampling timesteps, leading to considerable time and memory costs. To address this, we introduce an efficient reward strategy that deliberately disturbs the input images by adding noise, and then uses the single-step denoised images for reward fine-tuning. This avoids the extensive costs associated with image sampling, allowing for more efficient reward fine-tuning. Extensive experiments show that ControlNet++ significantly improves controllability under various conditional controls. For example, it achieves improvements over ControlNet by 7.9% mIoU, 13.4% SSIM, and 7.6% RMSE, respectively, for segmentation mask, line-art edge, and depth conditions.

Abstract (translated)

为了增强文本到图像扩散模型的可控性,现有的诸如ControlNet的努力已经包含了图像为基础的条件控制。在本文中,我们揭示现有方法在生成与图像条件控制相符的图像方面仍然面临着重大挑战。为此,我们提出了ControlNet++,一种通过明确优化生成图像与条件控制之间像素级循环一致性来提高可控性新颖的方法。具体来说,对于输入条件控制,我们使用预训练的判别性奖励模型提取相应条件,然后优化输入条件与提取条件的一致性损失。一个简单的实现方法是从随机噪声中生成图像,然后计算一致性损失,但这种方法需要存储多个抽样时刻的梯度,导致大量的时间和内存成本。为了应对这个问题,我们引入了一种有效的奖励策略,故意在输入图像上添加噪声,然后使用单步去噪图像进行奖励微调。这避免了与图像采样相关的广泛成本,使得奖励微调更加高效。大量实验结果表明,ControlNet++在各种条件控制下显著提高了可控性。例如,在分割掩码、线形边和深度条件下,ControlNet++分别实现了7.9%的mIoU提高、13.4%的SSIM提高和7.6%的RMSE提高。

URL

https://arxiv.org/abs/2404.07987

PDF

https://arxiv.org/pdf/2404.07987.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot