Abstract
Video-to-video editing involves editing a source video along with additional control (such as text prompts, subjects, or styles) to generate a new video that aligns with the source video and the provided control. Traditional methods have been constrained to certain editing types, limiting their ability to meet the wide range of user demands. In this paper, we introduce AnyV2V, a novel training-free framework designed to simplify video editing into two primary steps: (1) employing an off-the-shelf image editing model (e.g. InstructPix2Pix, InstantID, etc) to modify the first frame, (2) utilizing an existing image-to-video generation model (e.g. I2VGen-XL) for DDIM inversion and feature injection. In the first stage, AnyV2V can plug in any existing image editing tools to support an extensive array of video editing tasks. Beyond the traditional prompt-based editing methods, AnyV2V also can support novel video editing tasks, including reference-based style transfer, subject-driven editing, and identity manipulation, which were unattainable by previous methods. In the second stage, AnyV2V can plug in any existing image-to-video models to perform DDIM inversion and intermediate feature injection to maintain the appearance and motion consistency with the source video. On the prompt-based editing, we show that AnyV2V can outperform the previous best approach by 35\% on prompt alignment, and 25\% on human preference. On the three novel tasks, we show that AnyV2V also achieves a high success rate. We believe AnyV2V will continue to thrive due to its ability to seamlessly integrate the fast-evolving image editing methods. Such compatibility can help AnyV2V to increase its versatility to cater to diverse user demands.
Abstract (translated)
视频剪辑涉及对原始视频进行编辑以及附加控制(如文本提示、主题或样式),以生成与原始视频和提供的控制相符合的新视频。传统方法对编辑类型有严格的限制,从而限制了它们满足用户需求的能力。在本文中,我们介绍了AnyV2V,一种新型的无训练免费框架,旨在简化视频编辑,将其分解为两个主要步骤:(1)使用一个标准的图像编辑模型(如InstructPix2Pix,InstantID等)对第一帧进行修改,(2)利用现有的图像到视频生成模型(如I2VGen-XL)进行DDIM反向和特征注入。在第一阶段,AnyV2V可以插入任何现有的图像编辑工具来支持广泛的视频编辑任务。除了传统的提示编辑方法之外,AnyV2V还可以支持新颖的视频编辑任务,包括基于参考的样式迁移、基于主题的编辑和身份 manipulation,这些任务是以前的方法无法实现的。在第二阶段,AnyV2V可以插入任何现有的图像到视频模型来进行DDIM反向和中间特征注入,以保持与原始视频的视觉效果和运动一致性。在提示编辑方面,我们证明了AnyV2V可以在提示对齐和人类偏好的基础上超越最先进的方案,分别提高了35\%和25\%。在三个新颖的任务中,我们也证明了AnyV2V具有很高的成功率。我们相信,由于其将快速发展的图像编辑方法无缝集成,AnyV2V将继续蓬勃发展。这种兼容性可以帮助AnyV2V增加其多样性以满足多样用户需求。
URL
https://arxiv.org/abs/2403.14468