Paper Reading AI Learner

Prompt-Free Diffusion: Taking 'Text' out of Text-to-Image Diffusion Models

2023-05-25 16:30:07
Xingqian Xu, Jiayi Guo, Zhangyang Wang, Gao Huang, Irfan Essa, Humphrey Shi

Abstract

Text-to-image (T2I) research has grown explosively in the past year, owing to the large-scale pre-trained diffusion models and many emerging personalization and editing approaches. Yet, one pain point persists: the text prompt engineering, and searching high-quality text prompts for customized results is more art than science. Moreover, as commonly argued: "an image is worth a thousand words" - the attempt to describe a desired image with texts often ends up being ambiguous and cannot comprehensively cover delicate visual details, hence necessitating more additional controls from the visual domain. In this paper, we take a bold step forward: taking "Text" out of a pre-trained T2I diffusion model, to reduce the burdensome prompt engineering efforts for users. Our proposed framework, Prompt-Free Diffusion, relies on only visual inputs to generate new images: it takes a reference image as "context", an optional image structural conditioning, and an initial noise, with absolutely no text prompt. The core architecture behind the scene is Semantic Context Encoder (SeeCoder), substituting the commonly used CLIP-based or LLM-based text encoder. The reusability of SeeCoder also makes it a convenient drop-in component: one can also pre-train a SeeCoder in one T2I model and reuse it for another. Through extensive experiments, Prompt-Free Diffusion is experimentally found to (i) outperform prior exemplar-based image synthesis approaches; (ii) perform on par with state-of-the-art T2I models using prompts following the best practice; and (iii) be naturally extensible to other downstream applications such as anime figure generation and virtual try-on, with promising quality. Our code and models are open-sourced at this https URL.

Abstract (translated)

图像生成文本(T2I)研究在过去一年里迅速发展,由于大规模训练扩散模型和许多新兴的个性化和编辑方法。然而,仍然存在一个痛点:文本 prompt engineering 和寻找高质量的自定义文本提示对于实现个性化的结果来说更像是艺术而不是科学。此外,正如通常所说的那样:“一个图像的价值在于它的千句话”——试图用文本描述想要的图像往往会导致模糊不清,不能全面覆盖微妙的视觉细节,因此需要更多的视觉域额外的控制。在本文中,我们将采取大胆一步:将“文本”从训练好的 T2I 扩散模型中移除,以减少用户的负担式的文本提示工程努力。我们提出的框架称为Prompt-Free Diffusion,它仅依赖于视觉输入生成新图像:它使用参考图像作为“上下文”,可选的图像结构 conditioning,以及最初的噪声,完全没有文本提示。在幕后的核心架构是语义上下文编码器(SeeCoder),替代了常用的CLIP 或 LLM 文本编码器。SeeCoder的可重用性也使其成为一个方便的升级组件:你也可以在其中一个 T2I 模型中预先训练 SeeCoder,并用它来另一个模型。通过广泛的实验,Prompt-Free Diffusion 实验表明(i)比先前基于示例的图像合成方法表现更好;(ii)与最先进的 T2I 模型使用最佳实践的提示运行水平相当;(iii)自然地可扩展到其他下游应用,如动画人物生成和虚拟试穿,具有出色的质量。我们的代码和模型在此 https URL 上开源。

URL

https://arxiv.org/abs/2305.16223

PDF

https://arxiv.org/pdf/2305.16223.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot