Paper Reading AI Learner

AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks

2024-03-21 15:15:00
Max Ku, Cong Wei, Weiming Ren, Huan Yang, Wenhu Chen

Abstract

Video-to-video editing involves editing a source video along with additional control (such as text prompts, subjects, or styles) to generate a new video that aligns with the source video and the provided control. Traditional methods have been constrained to certain editing types, limiting their ability to meet the wide range of user demands. In this paper, we introduce AnyV2V, a novel training-free framework designed to simplify video editing into two primary steps: (1) employing an off-the-shelf image editing model (e.g. InstructPix2Pix, InstantID, etc) to modify the first frame, (2) utilizing an existing image-to-video generation model (e.g. I2VGen-XL) for DDIM inversion and feature injection. In the first stage, AnyV2V can plug in any existing image editing tools to support an extensive array of video editing tasks. Beyond the traditional prompt-based editing methods, AnyV2V also can support novel video editing tasks, including reference-based style transfer, subject-driven editing, and identity manipulation, which were unattainable by previous methods. In the second stage, AnyV2V can plug in any existing image-to-video models to perform DDIM inversion and intermediate feature injection to maintain the appearance and motion consistency with the source video. On the prompt-based editing, we show that AnyV2V can outperform the previous best approach by 35\% on prompt alignment, and 25\% on human preference. On the three novel tasks, we show that AnyV2V also achieves a high success rate. We believe AnyV2V will continue to thrive due to its ability to seamlessly integrate the fast-evolving image editing methods. Such compatibility can help AnyV2V to increase its versatility to cater to diverse user demands.

Abstract (translated)

视频剪辑涉及对原始视频进行编辑以及附加控制(如文本提示、主题或样式),以生成与原始视频和提供的控制相符合的新视频。传统方法对编辑类型有严格的限制,从而限制了它们满足用户需求的能力。在本文中,我们介绍了AnyV2V,一种新型的无训练免费框架,旨在简化视频编辑,将其分解为两个主要步骤:(1)使用一个标准的图像编辑模型(如InstructPix2Pix,InstantID等)对第一帧进行修改,(2)利用现有的图像到视频生成模型(如I2VGen-XL)进行DDIM反向和特征注入。在第一阶段,AnyV2V可以插入任何现有的图像编辑工具来支持广泛的视频编辑任务。除了传统的提示编辑方法之外,AnyV2V还可以支持新颖的视频编辑任务,包括基于参考的样式迁移、基于主题的编辑和身份 manipulation,这些任务是以前的方法无法实现的。在第二阶段,AnyV2V可以插入任何现有的图像到视频模型来进行DDIM反向和中间特征注入,以保持与原始视频的视觉效果和运动一致性。在提示编辑方面,我们证明了AnyV2V可以在提示对齐和人类偏好的基础上超越最先进的方案,分别提高了35\%和25\%。在三个新颖的任务中,我们也证明了AnyV2V具有很高的成功率。我们相信,由于其将快速发展的图像编辑方法无缝集成,AnyV2V将继续蓬勃发展。这种兼容性可以帮助AnyV2V增加其多样性以满足多样用户需求。

URL

https://arxiv.org/abs/2403.14468

PDF

https://arxiv.org/pdf/2403.14468.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot