Paper Reading AI Learner

Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization

2024-01-30 05:56:12
Henglei Lv, Jiayu Xiao, Liang Li, Qingming Huang

Abstract

Diffusion-based text-to-image personalization have achieved great success in generating subjects specified by users among various contexts. Even though, existing finetuning-based methods still suffer from model overfitting, which greatly harms the generative diversity, especially when given subject images are few. To this end, we propose Pick-and-Draw, a training-free semantic guidance approach to boost identity consistency and generative diversity for personalization methods. Our approach consists of two components: appearance picking guidance and layout drawing guidance. As for the former, we construct an appearance palette with visual features from the reference image, where we pick local patterns for generating the specified subject with consistent identity. As for layout drawing, we outline the subject's contour by referring to a generative template from the vanilla diffusion model, and inherit the strong image prior to synthesize diverse contexts according to different text conditions. The proposed approach can be applied to any personalized diffusion models and requires as few as a single reference image. Qualitative and quantitative experiments show that Pick-and-Draw consistently improves identity consistency and generative diversity, pushing the trade-off between subject fidelity and image-text fidelity to a new Pareto frontier.

Abstract (translated)

基于扩散的文本-图像个性化取得了很大的成功,在生成用户指定各种上下文中的主题之间。尽管如此,现有的基于微调的方法仍然存在模型过拟合的问题,这极大地破坏了生成多样性,尤其是在主题图像很少的情况下。为此,我们提出了Pick-and-Draw,一种无需训练的语义指导方法,以提高个性化方法的身份一致性和生成多样性。我们的方法包括两个组件:外观选择指导和布局绘制指导。关于前者,我们通过构建参考图像的视觉特征来构建外观调色板,在那里我们选择局部模式来生成指定主题的一致身份。关于布局绘制,我们通过参考原版扩散模型的生成模板来绘制主题轮廓,并根据不同的文本条件继承强大的图像先验。所提出的方法可以应用于任何个性化的扩散模型,并且只需要一个参考图像。定性和定量的实验证明,Pick-and-Draw始终能够提高身份一致性和生成多样性,将主题一致性和图像文本一致性之间的权衡推向新的帕累托前沿。

URL

https://arxiv.org/abs/2401.16762

PDF

https://arxiv.org/pdf/2401.16762.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot