Paper Reading AI Learner

InstructVid2Vid: Controllable Video Editing with Natural Language Instructions

2023-05-21 03:28:13
Bosheng Qin, Juncheng Li, Siliang Tang, Tat-Seng Chua, Yueting Zhuang

Abstract

We present an end-to-end diffusion-based method for editing videos with human language instructions, namely $\textbf{InstructVid2Vid}$. Our approach enables the editing of input videos based on natural language instructions without any per-example fine-tuning or inversion. The proposed InstructVid2Vid model combines a pretrained image generation model, Stable Diffusion, with a conditional 3D U-Net architecture to generate time-dependent sequence of video frames. To obtain the training data, we incorporate the knowledge and expertise of different models, including ChatGPT, BLIP, and Tune-a-Video, to synthesize video-instruction triplets, which is a more cost-efficient alternative to collecting data in real-world scenarios. To improve the consistency between adjacent frames of generated videos, we propose the Frame Difference Loss, which is incorporated during the training process. During inference, we extend the classifier-free guidance to text-video input to guide the generated results, making them more related to both the input video and instruction. Experiments demonstrate that InstructVid2Vid is able to generate high-quality, temporally coherent videos and perform diverse edits, including attribute editing, change of background, and style transfer. These results highlight the versatility and effectiveness of our proposed method. Code is released in $\href{this https URL}{InstructVid2Vid}$.

Abstract (translated)

我们提出了一种端到端扩散的方法,用于编辑带有人类语言指令的视频,即$\textbf{InstructVid2Vid}$。我们的方法不需要针对每个例子进行微调或反向操作,即可基于自然语言指令进行输入视频的编辑。我们提出的InstructVid2Vid模型结合了预训练的图像生成模型Stable Diffusion和条件3D U-Net架构,生成时间依赖的视频帧序列。为了获得训练数据,我们融合了各种模型的知识和 expertise,包括ChatGPT、BLIP和 Tune-a-Video,以合成视频指令三帧,这是一种比在现实生活中收集数据更加高效且成本更低的替代方法。为了改善生成视频中相邻帧的一致性,我们提出了Frame Difference Loss,它在训练过程中被引入。在推断期间,我们扩展了无分类器指导的文字视频输入,以指导生成的结果,使其与输入视频和指令更加相关。实验结果表明,InstructVid2Vid能够生成高质量、时间一致性的视频,并进行各种编辑,包括属性编辑、背景更改和风格转移。这些结果强调了我们提出的方法的广泛性和有效性。代码已发布在$\href{this https URL}{InstructVid2Vid}$。

URL

https://arxiv.org/abs/2305.12328

PDF

https://arxiv.org/pdf/2305.12328.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot