Pix2Video: Video Editing using Image Diffusion

Abstract
Abstract (translated)
URL
PDF

Abstract

Image diffusion models, trained on massive image collections, have emerged as the most versatile image generator model in terms of quality and diversity. They support inverting real images and conditional (e.g., text) generation, making them attractive for high-quality image editing applications. We investigate how to use such pre-trained image models for text-guided video editing. The critical challenge is to achieve the target edits while still preserving the content of the source video. Our method works in two simple steps: first, we use a pre-trained structure-guided (e.g., depth) image diffusion model to perform text-guided edits on an anchor frame; then, in the key step, we progressively propagate the changes to the future frames via self-attention feature injection to adapt the core denoising step of the diffusion model. We then consolidate the changes by adjusting the latent code for the frame before continuing the process. Our approach is training-free and generalizes to a wide range of edits. We demonstrate the effectiveness of the approach by extensive experimentation and compare it against four different prior and parallel efforts (on ArXiv). We demonstrate that realistic text-guided video edits are possible, without any compute-intensive preprocessing or video-specific finetuning.

Abstract (translated)

图像扩散模型是在大量图像集合上训练的,因此它们在质量和多样性方面最为多才多艺的图像生成模型。它们支持逆真实图像和条件生成(如文本生成)的功能,使得它们对于高质量的图像编辑应用程序非常有吸引力。我们研究如何使用这些预训练的图像模型来进行文本引导的视频编辑。关键挑战是既要实现目标编辑,又要保留源视频的内容。我们的方法有两个简单的步骤:第一个步骤是使用预训练的结构引导(如深度)图像扩散模型,对目标帧进行文本引导编辑;第二个步骤是通过自我注意力特征注入逐步传播变化到未来的帧,以适应扩散模型的核心去噪步骤。然后我们通过调整帧的隐写代码来巩固变化,并在继续过程之前进行。我们的方法没有训练,并适用于广泛的编辑操作。我们通过广泛的实验比较了这些方法的效果(在ArXiv上)。我们证明,没有计算密集型预处理或针对视频的精细调整,可以实现真实的文本引导视频编辑。

URL

https://arxiv.org/abs/2303.12688

PDF

https://arxiv.org/pdf/2303.12688.pdf