Abstract
Controllable video generation remains a significant challenge, despite recent advances in generating high-quality and consistent videos. Most existing methods for controlling video generation treat the video as a whole, neglecting intricate fine-grained spatiotemporal relationships, which limits both control precision and efficiency. In this paper, we propose Controllable Video Generative Adversarial Networks (CoVoGAN) to disentangle the video concepts, thus facilitating efficient and independent control over individual concepts. Specifically, following the minimal change principle, we first disentangle static and dynamic latent variables. We then leverage the sufficient change property to achieve component-wise identifiability of dynamic latent variables, enabling independent control over motion and identity. To establish the theoretical foundation, we provide a rigorous analysis demonstrating the identifiability of our approach. Building on these theoretical insights, we design a Temporal Transition Module to disentangle latent dynamics. To enforce the minimal change principle and sufficient change property, we minimize the dimensionality of latent dynamic variables and impose temporal conditional independence. To validate our approach, we integrate this module as a plug-in for GANs. Extensive qualitative and quantitative experiments on various video generation benchmarks demonstrate that our method significantly improves generation quality and controllability across diverse real-world scenarios.
Abstract (translated)
尽管在生成高质量和一致的视频方面取得了近期进展,但可控视频生成仍然是一项重大挑战。大多数现有的控制视频生成的方法将整个视频视为整体,忽略了复杂的细粒度时空关系,从而限制了控制精度和效率。为此,本文提出了可控制视频生成对抗网络(CoVoGAN),以分离视频概念,从而使对单个概念的独立高效控制成为可能。 具体来说,在遵循最小变化原则的情况下,我们首先将静态和动态潜在变量解耦。然后利用充分变化属性来实现动态潜在变量在组件级别的识别性,从而能够单独控制运动和身份。为了建立理论基础,我们提供了严格的分析,证明了我们的方法的可识别性。在此基础上,我们设计了一个时间过渡模块以分离潜在动力学。 为确保最小变化原则和充分变化属性,我们减少了潜在动态变量的维度,并施加了条件时间独立性。为了验证该方法的有效性,我们将此模块作为GAN插件集成起来进行实验。在各种视频生成基准上的大量定性和定量实验证明,我们的方法显著提升了不同现实场景下的生成质量和可控性。
URL
https://arxiv.org/abs/2502.02690