Paper Reading AI Learner

MotionMaster: Training-free Camera Motion Transfer For Video Generation

2024-04-24 10:28:54
Teng Hu, Jiangning Zhang, Ran Yi, Yating Wang, Hongrui Huang, Jieyu Weng, Yabiao Wang, Lizhuang Ma

Abstract

The emergence of diffusion models has greatly propelled the progress in image and video generation. Recently, some efforts have been made in controllable video generation, including text-to-video generation and video motion control, among which camera motion control is an important topic. However, existing camera motion control methods rely on training a temporal camera module, and necessitate substantial computation resources due to the large amount of parameters in video generation models. Moreover, existing methods pre-define camera motion types during training, which limits their flexibility in camera control. Therefore, to reduce training costs and achieve flexible camera control, we propose COMD, a novel training-free video motion transfer model, which disentangles camera motions and object motions in source videos and transfers the extracted camera motions to new videos. We first propose a one-shot camera motion disentanglement method to extract camera motion from a single source video, which separates the moving objects from the background and estimates the camera motion in the moving objects region based on the motion in the background by solving a Poisson equation. Furthermore, we propose a few-shot camera motion disentanglement method to extract the common camera motion from multiple videos with similar camera motions, which employs a window-based clustering technique to extract the common features in temporal attention maps of multiple videos. Finally, we propose a motion combination method to combine different types of camera motions together, enabling our model a more controllable and flexible camera control. Extensive experiments demonstrate that our training-free approach can effectively decouple camera-object motion and apply the decoupled camera motion to a wide range of controllable video generation tasks, achieving flexible and diverse camera motion control.

Abstract (translated)

扩散模型的出现极大地推动了图像和视频生成的进步。最近,在可控制视频生成方面,包括文本到视频生成和视频运动控制,已经做出了一些努力。其中,相机运动控制是一个重要的话题。然而,现有的相机运动控制方法依赖于训练一个时间相机模块,由于视频生成模型的巨大参数数量,需要大量的计算资源。此外,现有的方法在训练过程中预定义了相机运动类型,这限制了他们在相机控制方面的灵活性。因此,为了降低训练成本并实现灵活的相机控制,我们提出了COMD,一种新颖的训练-free视频运动传输模型,它解耦了源视频中的相机运动和物体运动,并将提取的相机运动传输到新的视频中。我们首先提出了一种单击相机运动解耦方法,从单个源视频中提取相机运动,将移动物体与背景分离,并根据背景中的运动在运动物体区域求解泊松方程。此外,我们还提出了一种几 shot相机运动解耦方法,从具有相似相机运动的多视频中提取共同的相机运动,采用基于窗口的聚类技术提取多个视频中的共同特征。最后,我们提出了一种运动组合方法,将不同类型的相机运动结合在一起,使我们的模型具有更可控制和灵活的相机控制。大量实验证明,我们的无训练方法可以有效地将相机-物体运动与可控制视频生成任务分开,将解耦后的相机运动应用到广泛的控制视频生成任务中,实现灵活和多样化的相机运动控制。

URL

https://arxiv.org/abs/2404.15789

PDF

https://arxiv.org/pdf/2404.15789.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot