Abstract
Text-to-image diffusion models benefit artists with high-quality image generation. Yet its stochastic nature prevent artists from creating consistent images of the same character. Existing methods try to tackle this challenge and generate consistent content in various ways. However, they either depend on external data or require expensive tuning of the diffusion model. For this issue, we argue that a lightweight but intricate guidance is enough to function. Aiming at this, we lead the way to formalize the objective of consistent generation, derive a clustering-based score function and propose a novel paradigm, OneActor. We design a cluster-conditioned model which incorporates posterior samples to guide the denoising trajectories towards the target cluster. To overcome the overfitting challenge shared by one-shot tuning pipelines, we devise auxiliary components to simultaneously augment the tuning and regulate the inference. This technique is later verified to significantly enhance the content diversity of generated images. Comprehensive experiments show that our method outperforms a variety of baselines with satisfactory character consistency, superior prompt conformity as well as high image quality. And our method is at least 4 times faster than tuning-based baselines. Furthermore, to our best knowledge, we first prove that the semantic space has the same interpolation property as the latent space dose. This property can serve as another promising tool for fine generation control.
Abstract (translated)
文本到图像扩散模型为高质量图像生成艺术家带来了好处。然而,其随机性阻止了艺术家创建相同角色的 consistent 图像。现有方法试图解决这个挑战,并以各种方式生成一致的内容。然而,它们要么依赖于外部数据,要么需要对扩散模型进行昂贵的调整。针对这个问题,我们认为轻量但复杂指导就足够了。为了实现这个目标,我们提出了一个名为 OneActor 的全新范式。我们设计了一个包含后验样本的聚类条件模型,该模型通过引导去噪轨迹朝向目标聚类来指导模糊化过程。为了克服一阶调整管道共享的过拟合挑战,我们设计了一些辅助组件来同时增强调整和控制推理过程。这种技术后来被证明可以显著增强生成图像的内容多样性。综合实验证明,我们的方法在具有满意的字符一致性、卓越的提示符合性以及高质量图像的基础上优于各种基线。而且,据我们所知,我们的方法至少是调整基线的 4 倍快。此外,据我们最好了解,我们首先证明语义空间与潜在空间具有相同的平滑特性。这种特性可以作为另一种改进生成控制的有前景的工具。
URL
https://arxiv.org/abs/2404.10267