Paper Reading AI Learner

Diffusion Self-Distillation for Zero-Shot Customized Image Generation

2024-11-27 18:58:52
Shengqu Cai, Eric Chan, Yunzhi Zhang, Leonidas Guibas, Jiajun Wu, Gordon Wetzstein

Abstract

Text-to-image diffusion models produce impressive results but are frustrating tools for artists who desire fine-grained control. For example, a common use case is to create images of a specific instance in novel contexts, i.e., "identity-preserving generation". This setting, along with many other tasks (e.g., relighting), is a natural fit for image+text-conditional generative models. However, there is insufficient high-quality paired data to train such a model directly. We propose Diffusion Self-Distillation, a method for using a pre-trained text-to-image model to generate its own dataset for text-conditioned image-to-image tasks. We first leverage a text-to-image diffusion model's in-context generation ability to create grids of images and curate a large paired dataset with the help of a Visual-Language Model. We then fine-tune the text-to-image model into a text+image-to-image model using the curated paired dataset. We demonstrate that Diffusion Self-Distillation outperforms existing zero-shot methods and is competitive with per-instance tuning techniques on a wide range of identity-preservation generation tasks, without requiring test-time optimization.

Abstract (translated)

文本到图像的扩散模型能够产生令人印象深刻的结果,但对于希望实现精细控制的艺术家而言却是令人沮丧的工具。例如,一个常见的应用场景是创建特定实例在新背景下的图像,即“身份保持生成”。这种设置以及其他许多任务(如重新打光)非常适合基于图像和文本条件的生成模型。然而,缺乏高质量的配对数据直接训练这样的模型。我们提出了扩散自蒸馏方法,利用预训练的文本到图像模型生成自己的数据集用于文本条件下的图像到图像任务。首先,我们利用一个文本到图像扩散模型的上下文生成能力创建图像网格,并借助视觉-语言模型帮助整理出一个大规模配对数据集。然后,我们使用这个整理好的配对数据集将文本到图像模型微调为一个基于文本和图像输入的图像生成模型。我们的实验表明,扩散自蒸馏方法在广泛的保持身份生成任务上超越了现有的零样本方法,并且无需测试时优化也能够与针对每个实例进行调整的技术相竞争。

URL

https://arxiv.org/abs/2411.18616

PDF

https://arxiv.org/pdf/2411.18616.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot