The diffusion-based generative models have achieved remarkable success in text-based image generation. However, since it contains enormous randomness in generation progress, it is still challenging to apply such models for real-world visual content editing, especially in videos. In this paper, we propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask. To edit videos consistently, we propose several techniques based on the pre-trained models. Firstly, in contrast to the straightforward DDIM inversion technique, our approach captures intermediate attention maps during inversion, which effectively retain both structural and motion information. These maps are directly fused in the editing process rather than generated during denoising. To further minimize semantic leakage of the source video, we then fuse self-attentions with a blending mask obtained by cross-attention features from the source prompt. Furthermore, we have implemented a reform of the self-attention mechanism in denoising UNet by introducing spatial-temporal attention to ensure frame consistency. Yet succinct, our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model. We also have a better zero-shot shape-aware editing ability based on the text-to-video model. Extensive experiments demonstrate our superior temporal consistency and editing capability than previous works.
扩散based生成模型在基于文本的图像生成方面取得了惊人的成功。然而,由于在生成过程中包含了巨大的随机性,将这些模型应用于实际的视觉内容编辑仍然具有挑战性,特别是在视频编辑方面。在本文中,我们提出了FateZero,一种无提示训练或特定掩码的视频文本编辑方法,以 consistently 修改视频。为了持续地编辑视频,我们提出了基于预训练模型的一些技术。首先,与直接DDIM翻转技术相反,我们的算法在翻转过程中捕获了中间注意力地图,有效地保留了结构信息和运动信息。这些地图直接在编辑过程中融合而不是在去噪过程中生成。进一步减少源视频的语义泄漏,我们然后将自注意力与从源提示中提取的交叉注意力特征的融合掩码融合。此外,我们还实现了在去噪UNet中引入空间注意力来保证帧一致性的自注意力机制改革。然而简洁明了,我们的算法是首先展示从训练的文本到图像模型的零次文本驱动视频风格和局部属性编辑能力的方法。我们还基于文本到视频模型实现了更好的零次形状aware编辑能力。广泛的实验表明,我们的 temporal 一致性和编辑能力比先前的工作更好。