Paper Reading AI Learner

Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

2023-05-25 17:59:58
Shihao Zhao, Dongdong Chen, Yen-Chun Chen, Jianmin Bao, Shaozhe Hao, Lu Yuan, Kwan-Yee K. Wong

Abstract

Text-to-Image diffusion models have made tremendous progress over the past two years, enabling the generation of highly realistic images based on open-domain text descriptions. However, despite their success, text descriptions often struggle to adequately convey detailed controls, even when composed of long and complex texts. Moreover, recent studies have also shown that these models face challenges in understanding such complex texts and generating the corresponding images. Therefore, there is a growing need to enable more control modes beyond text description. In this paper, we introduce Uni-ControlNet, a novel approach that allows for the simultaneous utilization of different local controls (e.g., edge maps, depth map, segmentation masks) and global controls (e.g., CLIP image embeddings) in a flexible and composable manner within one model. Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models, eliminating the huge cost of training from scratch. Moreover, thanks to some dedicated adapter designs, Uni-ControlNet only necessitates a constant number (i.e., 2) of adapters, regardless of the number of local or global controls used. This not only reduces the fine-tuning costs and model size, making it more suitable for real-world deployment, but also facilitate composability of different conditions. Through both quantitative and qualitative comparisons, Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability. Code is available at \url{this https URL}.

Abstract (translated)

文本到图像扩散模型在过去两年中取得了巨大的进展,基于公开领域的文本描述能够生成高度逼真的图像。然而,尽管它们取得了成功,文本描述常常难以充分传达详细的控制,即使包含长且复杂的文本。此外,最近的研究表明,这些模型在理解这些复杂的文本和生成相应的图像方面面临挑战。因此,越来越需要实现更多的控制模式超越文本描述。在本文中,我们介绍了 Uni-ControlNet,一种新颖的方法,能够同时利用不同的局部控制(例如边缘图、深度图、分块掩码)和全局控制(例如CLIP图像嵌入)在一个模型中以灵活和可组合的方式使用。与现有方法不同,Uni-ControlNet只需要在训练前冻存的文本到图像扩散模型进行微调,消除从头训练的巨大成本。此外,得益于一些专门的适配器设计,Uni-ControlNet只需要一个恒定的数量(即2)的适配器,无论使用多少局部或全局控制。这不仅减少了微调成本和模型大小,使其更适用于现实世界的部署,而且还促进了不同条件的可组合性。通过量化和定性比较,Uni-ControlNet证明了它在控制性和生成质量以及可组合性方面的优势。代码可在 \url{this https URL} 找到。

URL

https://arxiv.org/abs/2305.16322

PDF

https://arxiv.org/pdf/2305.16322.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot