Paper Reading AI Learner

Pix2Video: Video Editing using Image Diffusion

2023-03-22 16:36:10
Duygu Ceylan, Chun-Hao Paul Huang, Niloy J. Mitra

Abstract

Image diffusion models, trained on massive image collections, have emerged as the most versatile image generator model in terms of quality and diversity. They support inverting real images and conditional (e.g., text) generation, making them attractive for high-quality image editing applications. We investigate how to use such pre-trained image models for text-guided video editing. The critical challenge is to achieve the target edits while still preserving the content of the source video. Our method works in two simple steps: first, we use a pre-trained structure-guided (e.g., depth) image diffusion model to perform text-guided edits on an anchor frame; then, in the key step, we progressively propagate the changes to the future frames via self-attention feature injection to adapt the core denoising step of the diffusion model. We then consolidate the changes by adjusting the latent code for the frame before continuing the process. Our approach is training-free and generalizes to a wide range of edits. We demonstrate the effectiveness of the approach by extensive experimentation and compare it against four different prior and parallel efforts (on ArXiv). We demonstrate that realistic text-guided video edits are possible, without any compute-intensive preprocessing or video-specific finetuning.

Abstract (translated)

图像扩散模型是在大量图像集合上训练的,因此它们在质量和多样性方面最为多才多艺的图像生成模型。它们支持逆真实图像和条件生成(如文本生成)的功能,使得它们对于高质量的图像编辑应用程序非常有吸引力。我们研究如何使用这些预训练的图像模型来进行文本引导的视频编辑。关键挑战是既要实现目标编辑,又要保留源视频的内容。我们的方法有两个简单的步骤:第一个步骤是使用预训练的结构引导(如深度)图像扩散模型,对目标帧进行文本引导编辑;第二个步骤是通过自我注意力特征注入逐步传播变化到未来的帧,以适应扩散模型的核心去噪步骤。然后我们通过调整帧的隐写代码来巩固变化,并在继续过程之前进行。我们的方法没有训练,并适用于广泛的编辑操作。我们通过广泛的实验比较了这些方法的效果(在ArXiv上)。我们证明,没有计算密集型预处理或针对视频的精细调整,可以实现真实的文本引导视频编辑。

URL

https://arxiv.org/abs/2303.12688

PDF

https://arxiv.org/pdf/2303.12688.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot