Abstract
Precise image editing with text-to-image models has attracted increasing interest due to their remarkable generative capabilities and user-friendly nature. However, such attempts face the pivotal challenge of misalignment between the intended precise editing target regions and the broader area impacted by the guidance in practice. Despite excellent methods leveraging attention mechanisms that have been developed to refine the editing guidance, these approaches necessitate modifications through complex network architecture and are limited to specific editing tasks. In this work, we re-examine the diffusion process and misalignment problem from a frequency perspective, revealing that, due to the power law of natural images and the decaying noise schedule, the denoising network primarily recovers low-frequency image components during the earlier timesteps and thus brings excessive low-frequency signals for editing. Leveraging this insight, we introduce a novel fine-tuning free approach that employs progressive $\textbf{Fre}$qu$\textbf{e}$ncy truncation to refine the guidance of $\textbf{Diff}$usion models for universal editing tasks ($\textbf{FreeDiff}$). Our method achieves comparable results with state-of-the-art methods across a variety of editing tasks and on a diverse set of images, highlighting its potential as a versatile tool in image editing applications.
Abstract (translated)
通过利用文本到图像模型的生成能力以及用户友好的特点,精确图像编辑引起了越来越多的关注。然而,这些尝试面临着关键挑战:预期精确编辑目标区域与实际指导区域之间存在的不一致性。尽管已经开发出了一些利用注意机制来优化编辑指导的方法,但这种方法需要通过复杂的网络架构进行修改,并且仅限于特定的编辑任务。在这项工作中,我们从频率角度重新审视了扩散过程和偏差问题,发现由于自然图像的功率律和衰减噪声时间表,去噪网络主要在较早的时间步恢复低频图像成分,从而为编辑带来过量的低频信号。利用这一发现,我们引入了一种新型的免费编辑方法,该方法采用渐进式频率截断来优化扩散模型的指导,以实现通用编辑任务(免费扩散)。我们的方法在各种编辑任务中与最先进的方法达到相当的结果,在多样性的图像上表现出色,这表明在图像编辑应用中具有很大的潜力。
URL
https://arxiv.org/abs/2404.11895