Abstract
Text-to-image diffusion models produce impressive results but are frustrating tools for artists who desire fine-grained control. For example, a common use case is to create images of a specific instance in novel contexts, i.e., "identity-preserving generation". This setting, along with many other tasks (e.g., relighting), is a natural fit for image+text-conditional generative models. However, there is insufficient high-quality paired data to train such a model directly. We propose Diffusion Self-Distillation, a method for using a pre-trained text-to-image model to generate its own dataset for text-conditioned image-to-image tasks. We first leverage a text-to-image diffusion model's in-context generation ability to create grids of images and curate a large paired dataset with the help of a Visual-Language Model. We then fine-tune the text-to-image model into a text+image-to-image model using the curated paired dataset. We demonstrate that Diffusion Self-Distillation outperforms existing zero-shot methods and is competitive with per-instance tuning techniques on a wide range of identity-preservation generation tasks, without requiring test-time optimization.
Abstract (translated)
文本到图像的扩散模型能够产生令人印象深刻的结果,但对于希望实现精细控制的艺术家而言却是令人沮丧的工具。例如,一个常见的应用场景是创建特定实例在新背景下的图像,即“身份保持生成”。这种设置以及其他许多任务(如重新打光)非常适合基于图像和文本条件的生成模型。然而,缺乏高质量的配对数据直接训练这样的模型。我们提出了扩散自蒸馏方法,利用预训练的文本到图像模型生成自己的数据集用于文本条件下的图像到图像任务。首先,我们利用一个文本到图像扩散模型的上下文生成能力创建图像网格,并借助视觉-语言模型帮助整理出一个大规模配对数据集。然后,我们使用这个整理好的配对数据集将文本到图像模型微调为一个基于文本和图像输入的图像生成模型。我们的实验表明,扩散自蒸馏方法在广泛的保持身份生成任务上超越了现有的零样本方法,并且无需测试时优化也能够与针对每个实例进行调整的技术相竞争。
URL
https://arxiv.org/abs/2411.18616