Abstract
Recent strides in the development of diffusion models, exemplified by advancements such as Stable Diffusion, have underscored their remarkable prowess in generating visually compelling images. However, the imperative of achieving a seamless alignment between the generated image and the provided prompt persists as a formidable challenge. This paper traces the root of these difficulties to invalid initial noise, and proposes a solution in the form of Initial Noise Optimization (InitNO), a paradigm that refines this noise. Considering text prompts, not all random noises are effective in synthesizing semantically-faithful images. We design the cross-attention response score and the self-attention conflict score to evaluate the initial noise, bifurcating the initial latent space into valid and invalid sectors. A strategically crafted noise optimization pipeline is developed to guide the initial noise towards valid regions. Our method, validated through rigorous experimentation, shows a commendable proficiency in generating images in strict accordance with text prompts. Our code is available at this https URL.
Abstract (translated)
近年来在扩散模型的研究中,例如稳定性扩散(Stable Diffusion)等先进技术的进步,已经强调了它们在生成视觉上令人印象深刻的图像方面的非凡能力。然而,实现生成图像与提供提示之间无缝对齐的需求仍然是一个难以克服的挑战。本文追溯这些困难的根源是无效的初始噪声,并提出了一个解决方案,形式为初始噪声优化(InitNO),这是一种范式,用于细化这种噪声。 考虑到文本提示,不是所有的随机噪声都能有效地生成 semantically-faithful(根据文本内容一致性)的图像。我们设计了一个跨注意力和自注意冲突评分来评估初始噪声,将初始局部空间划分为有效和无效领域。为了引导初始噪声流向有效区域,我们设计了一个策略化的噪声优化管道。 通过严谨的实验验证,我们的方法在生成与文本提示完全一致的图像方面表现出卓越的性能。我们的代码可在此链接处获取:https://github.com/your_username/your_repo_name
URL
https://arxiv.org/abs/2404.04650