Abstract
We present an end-to-end diffusion-based method for editing videos with human language instructions, namely $\textbf{InstructVid2Vid}$. Our approach enables the editing of input videos based on natural language instructions without any per-example fine-tuning or inversion. The proposed InstructVid2Vid model combines a pretrained image generation model, Stable Diffusion, with a conditional 3D U-Net architecture to generate time-dependent sequence of video frames. To obtain the training data, we incorporate the knowledge and expertise of different models, including ChatGPT, BLIP, and Tune-a-Video, to synthesize video-instruction triplets, which is a more cost-efficient alternative to collecting data in real-world scenarios. To improve the consistency between adjacent frames of generated videos, we propose the Frame Difference Loss, which is incorporated during the training process. During inference, we extend the classifier-free guidance to text-video input to guide the generated results, making them more related to both the input video and instruction. Experiments demonstrate that InstructVid2Vid is able to generate high-quality, temporally coherent videos and perform diverse edits, including attribute editing, change of background, and style transfer. These results highlight the versatility and effectiveness of our proposed method. Code is released in $\href{this https URL}{InstructVid2Vid}$.
Abstract (translated)
我们提出了一种端到端扩散的方法,用于编辑带有人类语言指令的视频,即$\textbf{InstructVid2Vid}$。我们的方法不需要针对每个例子进行微调或反向操作,即可基于自然语言指令进行输入视频的编辑。我们提出的InstructVid2Vid模型结合了预训练的图像生成模型Stable Diffusion和条件3D U-Net架构,生成时间依赖的视频帧序列。为了获得训练数据,我们融合了各种模型的知识和 expertise,包括ChatGPT、BLIP和 Tune-a-Video,以合成视频指令三帧,这是一种比在现实生活中收集数据更加高效且成本更低的替代方法。为了改善生成视频中相邻帧的一致性,我们提出了Frame Difference Loss,它在训练过程中被引入。在推断期间,我们扩展了无分类器指导的文字视频输入,以指导生成的结果,使其与输入视频和指令更加相关。实验结果表明,InstructVid2Vid能够生成高质量、时间一致性的视频,并进行各种编辑,包括属性编辑、背景更改和风格转移。这些结果强调了我们提出的方法的广泛性和有效性。代码已发布在$\href{this https URL}{InstructVid2Vid}$。
URL
https://arxiv.org/abs/2305.12328