MagicFusion: Boosting Text-to-Image Generation Performance by Fusing Diffusion Models

Abstract
Abstract (translated)
URL
PDF

Abstract

The advent of open-source AI communities has produced a cornucopia of powerful text-guided diffusion models that are trained on various datasets. While few explorations have been conducted on ensembling such models to combine their strengths. In this work, we propose a simple yet effective method called Saliency-aware Noise Blending (SNB) that can empower the fused text-guided diffusion models to achieve more controllable generation. Specifically, we experimentally find that the responses of classifier-free guidance are highly related to the saliency of generated images. Thus we propose to trust different models in their areas of expertise by blending the predicted noises of two diffusion models in a saliency-aware manner. SNB is training-free and can be completed within a DDIM sampling process. Additionally, it can automatically align the semantics of two noise spaces without requiring additional annotations such as masks. Extensive experiments show the impressive effectiveness of SNB in various applications. Project page is available at this https URL.

Abstract (translated)

开源人工智能社区的出现，创造了大量基于各种数据集的强大文本引导扩散模型，这些模型在训练过程中使用了各种数据集。然而，只有少数研究涉及到结合这些模型的强项。在本文中，我们提出了一种简单但有效的方法，称为亮度感知噪声混合(SNB)，可以增强融合文本引导扩散模型的能力，实现更加可控的生成。具体而言，我们实验发现，无分类器指导的反应与生成图像的亮度感受度高度相关。因此，我们建议信任不同模型在其专业领域内的表现，通过将两个扩散模型的预测噪声以亮度感知方式混合，来信任它们。SNB不需要训练，可以在DDIM采样过程内完成。此外，它可以通过自动对齐两个噪声空间中的语义，而不需要额外的标注，如口罩。广泛的实验表明，SNB在各种应用中的惊人效果。项目页面可在本链接上找到。

URL

https://arxiv.org/abs/2303.13126

PDF

https://arxiv.org/pdf/2303.13126.pdf