Paper Reading AI Learner

InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization

2024-04-06 14:56:59
Xiefan Guo, Jinlin Liu, Miaomiao Cui, Jiankai Li, Hongyu Yang, Di Huang

Abstract

Recent strides in the development of diffusion models, exemplified by advancements such as Stable Diffusion, have underscored their remarkable prowess in generating visually compelling images. However, the imperative of achieving a seamless alignment between the generated image and the provided prompt persists as a formidable challenge. This paper traces the root of these difficulties to invalid initial noise, and proposes a solution in the form of Initial Noise Optimization (InitNO), a paradigm that refines this noise. Considering text prompts, not all random noises are effective in synthesizing semantically-faithful images. We design the cross-attention response score and the self-attention conflict score to evaluate the initial noise, bifurcating the initial latent space into valid and invalid sectors. A strategically crafted noise optimization pipeline is developed to guide the initial noise towards valid regions. Our method, validated through rigorous experimentation, shows a commendable proficiency in generating images in strict accordance with text prompts. Our code is available at this https URL.

Abstract (translated)

近年来在扩散模型的研究中,例如稳定性扩散(Stable Diffusion)等先进技术的进步,已经强调了它们在生成视觉上令人印象深刻的图像方面的非凡能力。然而,实现生成图像与提供提示之间无缝对齐的需求仍然是一个难以克服的挑战。本文追溯这些困难的根源是无效的初始噪声,并提出了一个解决方案,形式为初始噪声优化(InitNO),这是一种范式,用于细化这种噪声。 考虑到文本提示,不是所有的随机噪声都能有效地生成 semantically-faithful(根据文本内容一致性)的图像。我们设计了一个跨注意力和自注意冲突评分来评估初始噪声,将初始局部空间划分为有效和无效领域。为了引导初始噪声流向有效区域,我们设计了一个策略化的噪声优化管道。 通过严谨的实验验证,我们的方法在生成与文本提示完全一致的图像方面表现出卓越的性能。我们的代码可在此链接处获取:https://github.com/your_username/your_repo_name

URL

https://arxiv.org/abs/2404.04650

PDF

https://arxiv.org/pdf/2404.04650.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot