Paper Reading AI Learner

MagicFusion: Boosting Text-to-Image Generation Performance by Fusing Diffusion Models

2023-03-23 09:30:39
Jing Zhao, Heliang Zheng, Chaoyue Wang, Long Lan, Wenjing Yang

Abstract

The advent of open-source AI communities has produced a cornucopia of powerful text-guided diffusion models that are trained on various datasets. While few explorations have been conducted on ensembling such models to combine their strengths. In this work, we propose a simple yet effective method called Saliency-aware Noise Blending (SNB) that can empower the fused text-guided diffusion models to achieve more controllable generation. Specifically, we experimentally find that the responses of classifier-free guidance are highly related to the saliency of generated images. Thus we propose to trust different models in their areas of expertise by blending the predicted noises of two diffusion models in a saliency-aware manner. SNB is training-free and can be completed within a DDIM sampling process. Additionally, it can automatically align the semantics of two noise spaces without requiring additional annotations such as masks. Extensive experiments show the impressive effectiveness of SNB in various applications. Project page is available at this https URL.

Abstract (translated)

开源人工智能社区的出现,创造了大量基于各种数据集的强大文本引导扩散模型,这些模型在训练过程中使用了各种数据集。然而,只有少数研究涉及到结合这些模型的强项。在本文中,我们提出了一种简单但有效的方法,称为亮度感知噪声混合(SNB),可以增强融合文本引导扩散模型的能力,实现更加可控的生成。具体而言,我们实验发现,无分类器指导的反应与生成图像的亮度感受度高度相关。因此,我们建议信任不同模型在其专业领域内的表现,通过将两个扩散模型的预测噪声以亮度感知方式混合,来信任它们。SNB不需要训练,可以在DDIM采样过程内完成。此外,它可以通过自动对齐两个噪声空间中的语义,而不需要额外的标注,如口罩。广泛的实验表明,SNB在各种应用中的惊人效果。项目页面可在本链接上找到。

URL

https://arxiv.org/abs/2303.13126

PDF

https://arxiv.org/pdf/2303.13126.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot