Paper Reading AI Learner

Visual Prompting for One-shot Controllable Video Editing without Inversion

2025-04-19 16:00:47
Zhengbo Zhang, Yuxi Zhou, Duo Peng, Joo-Hwee Lim, Zhigang Tu, De Wen Soh, Lin Geng Foo

Abstract

One-shot controllable video editing (OCVE) is an important yet challenging task, aiming to propagate user edits that are made -- using any image editing tool -- on the first frame of a video to all subsequent frames, while ensuring content consistency between edited frames and source frames. To achieve this, prior methods employ DDIM inversion to transform source frames into latent noise, which is then fed into a pre-trained diffusion model, conditioned on the user-edited first frame, to generate the edited video. However, the DDIM inversion process accumulates errors, which hinder the latent noise from accurately reconstructing the source frames, ultimately compromising content consistency in the generated edited frames. To overcome it, our method eliminates the need for DDIM inversion by performing OCVE through a novel perspective based on visual prompting. Furthermore, inspired by consistency models that can perform multi-step consistency sampling to generate a sequence of content-consistent images, we propose a content consistency sampling (CCS) to ensure content consistency between the generated edited frames and the source frames. Moreover, we introduce a temporal-content consistency sampling (TCS) based on Stein Variational Gradient Descent to ensure temporal consistency across the edited frames. Extensive experiments validate the effectiveness of our approach.

Abstract (translated)

一次性可控视频编辑(OCVE)是一项重要但具有挑战性的任务,旨在将用户在视频第一帧上使用任何图像编辑工具进行的修改传播到所有后续帧,并确保编辑后的帧与原始帧之间保持内容一致性。为了实现这一目标,先前的方法采用DDIM逆向过程将源帧转换为潜在噪声,然后将其输入预训练的扩散模型,在此模型中以用户编辑的第一帧作为条件来生成编辑后的视频。然而,DDIM逆向过程中会累积错误,这阻碍了潜在噪声准确重建原始帧的能力,最终影响了生成的编辑帧的内容一致性。 为了克服这一问题,我们的方法通过一种新的基于视觉提示的方法消除了对DDIM逆向过程的需求,并在此基础上执行OCVE。此外,受到能够进行多步一致性采样以生成一系列内容一致图像的一致性模型的启发,我们提出了内容一致性采样(CCS),以确保生成的编辑帧与源帧之间的内容一致性。另外,我们引入了一种基于Stein变分梯度下降的时间-内容一致性采样(TCS)方法,以确保编辑后的帧之间的时间连续性。大量的实验验证了我们这种方法的有效性。

URL

https://arxiv.org/abs/2504.14335

PDF

https://arxiv.org/pdf/2504.14335.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot