Paper Reading AI Learner

Consistent Video Editing as Flow-Driven Image-to-Video Generation

2025-06-09 12:57:30
Ge Wang, Songlin Fan, Hangxu Liu, Quanjian Song, Hewei Wang, Jinfeng Xu

Abstract

With the prosper of video diffusion models, down-stream applications like video editing have been significantly promoted without consuming much computational cost. One particular challenge in this task lies at the motion transfer process from the source video to the edited one, where it requires the consideration of the shape deformation in between, meanwhile maintaining the temporal consistency in the generated video sequence. However, existing methods fail to model complicated motion patterns for video editing, and are fundamentally limited to object replacement, where tasks with non-rigid object motions like multi-object and portrait editing are largely neglected. In this paper, we observe that optical flows offer a promising alternative in complex motion modeling, and present FlowV2V to re-investigate video editing as a task of flow-driven Image-to-Video (I2V) generation. Specifically, FlowV2V decomposes the entire pipeline into first-frame editing and conditional I2V generation, and simulates pseudo flow sequence that aligns with the deformed shape, thus ensuring the consistency during editing. Experimental results on DAVIS-EDIT with improvements of 13.67% and 50.66% on DOVER and warping error illustrate the superior temporal consistency and sample quality of FlowV2V compared to existing state-of-the-art ones. Furthermore, we conduct comprehensive ablation studies to analyze the internal functionalities of the first-frame paradigm and flow alignment in the proposed method.

Abstract (translated)

随着视频扩散模型的发展,下游应用如视频编辑在不消耗过多计算成本的情况下得到了显著推动。然而,在从源视频到编辑后视频的运动转移过程中,一个特别的挑战在于需要考虑形状变形的同时保持生成视频序列的时间一致性。现有的方法无法模拟复杂的运动模式以进行视频编辑,并且基本局限于对象替换任务中,而像多对象和人像编辑这样涉及非刚性物体运动的任务则被很大程度上忽略了。 在本文中,我们观察到光学流为复杂运动建模提供了有希望的替代方案,并提出了FlowV2V来重新审视视频编辑作为由流动驱动的图像至视频(I2V)生成任务。具体而言,FlowV2V将整个流程分解为第一帧编辑和条件I2V生成,并模拟与变形形状对齐的伪流序列,从而确保编辑过程中的时间一致性。在DAVIS-EDIT数据集上的实验结果表明,与现有的最先进方法相比,FlowV2V在DOVER和扭曲误差方面分别提高了13.67%和50.66%,展示了其更优的时间一致性和样本质量。此外,我们进行了全面的消融研究来分析所提出方法中第一帧范式和流对齐的内部功能。

URL

https://arxiv.org/abs/2506.07713

PDF

https://arxiv.org/pdf/2506.07713.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot