Abstract
The emergence of diffusion models has greatly propelled the progress in image and video generation. Recently, some efforts have been made in controllable video generation, including text-to-video generation and video motion control, among which camera motion control is an important topic. However, existing camera motion control methods rely on training a temporal camera module, and necessitate substantial computation resources due to the large amount of parameters in video generation models. Moreover, existing methods pre-define camera motion types during training, which limits their flexibility in camera control. Therefore, to reduce training costs and achieve flexible camera control, we propose COMD, a novel training-free video motion transfer model, which disentangles camera motions and object motions in source videos and transfers the extracted camera motions to new videos. We first propose a one-shot camera motion disentanglement method to extract camera motion from a single source video, which separates the moving objects from the background and estimates the camera motion in the moving objects region based on the motion in the background by solving a Poisson equation. Furthermore, we propose a few-shot camera motion disentanglement method to extract the common camera motion from multiple videos with similar camera motions, which employs a window-based clustering technique to extract the common features in temporal attention maps of multiple videos. Finally, we propose a motion combination method to combine different types of camera motions together, enabling our model a more controllable and flexible camera control. Extensive experiments demonstrate that our training-free approach can effectively decouple camera-object motion and apply the decoupled camera motion to a wide range of controllable video generation tasks, achieving flexible and diverse camera motion control.
Abstract (translated)
扩散模型的出现极大地推动了图像和视频生成的进步。最近,在可控制视频生成方面,包括文本到视频生成和视频运动控制,已经做出了一些努力。其中,相机运动控制是一个重要的话题。然而,现有的相机运动控制方法依赖于训练一个时间相机模块,由于视频生成模型的巨大参数数量,需要大量的计算资源。此外,现有的方法在训练过程中预定义了相机运动类型,这限制了他们在相机控制方面的灵活性。因此,为了降低训练成本并实现灵活的相机控制,我们提出了COMD,一种新颖的训练-free视频运动传输模型,它解耦了源视频中的相机运动和物体运动,并将提取的相机运动传输到新的视频中。我们首先提出了一种单击相机运动解耦方法,从单个源视频中提取相机运动,将移动物体与背景分离,并根据背景中的运动在运动物体区域求解泊松方程。此外,我们还提出了一种几 shot相机运动解耦方法,从具有相似相机运动的多视频中提取共同的相机运动,采用基于窗口的聚类技术提取多个视频中的共同特征。最后,我们提出了一种运动组合方法,将不同类型的相机运动结合在一起,使我们的模型具有更可控制和灵活的相机控制。大量实验证明,我们的无训练方法可以有效地将相机-物体运动与可控制视频生成任务分开,将解耦后的相机运动应用到广泛的控制视频生成任务中,实现灵活和多样化的相机运动控制。
URL
https://arxiv.org/abs/2404.15789