Abstract
With the prosper of video diffusion models, down-stream applications like video editing have been significantly promoted without consuming much computational cost. One particular challenge in this task lies at the motion transfer process from the source video to the edited one, where it requires the consideration of the shape deformation in between, meanwhile maintaining the temporal consistency in the generated video sequence. However, existing methods fail to model complicated motion patterns for video editing, and are fundamentally limited to object replacement, where tasks with non-rigid object motions like multi-object and portrait editing are largely neglected. In this paper, we observe that optical flows offer a promising alternative in complex motion modeling, and present FlowV2V to re-investigate video editing as a task of flow-driven Image-to-Video (I2V) generation. Specifically, FlowV2V decomposes the entire pipeline into first-frame editing and conditional I2V generation, and simulates pseudo flow sequence that aligns with the deformed shape, thus ensuring the consistency during editing. Experimental results on DAVIS-EDIT with improvements of 13.67% and 50.66% on DOVER and warping error illustrate the superior temporal consistency and sample quality of FlowV2V compared to existing state-of-the-art ones. Furthermore, we conduct comprehensive ablation studies to analyze the internal functionalities of the first-frame paradigm and flow alignment in the proposed method.
Abstract (translated)
随着视频扩散模型的发展,下游应用如视频编辑在不消耗过多计算成本的情况下得到了显著推动。然而,在从源视频到编辑后视频的运动转移过程中,一个特别的挑战在于需要考虑形状变形的同时保持生成视频序列的时间一致性。现有的方法无法模拟复杂的运动模式以进行视频编辑,并且基本局限于对象替换任务中,而像多对象和人像编辑这样涉及非刚性物体运动的任务则被很大程度上忽略了。 在本文中,我们观察到光学流为复杂运动建模提供了有希望的替代方案,并提出了FlowV2V来重新审视视频编辑作为由流动驱动的图像至视频(I2V)生成任务。具体而言,FlowV2V将整个流程分解为第一帧编辑和条件I2V生成,并模拟与变形形状对齐的伪流序列,从而确保编辑过程中的时间一致性。在DAVIS-EDIT数据集上的实验结果表明,与现有的最先进方法相比,FlowV2V在DOVER和扭曲误差方面分别提高了13.67%和50.66%,展示了其更优的时间一致性和样本质量。此外,我们进行了全面的消融研究来分析所提出方法中第一帧范式和流对齐的内部功能。
URL
https://arxiv.org/abs/2506.07713