Abstract
Predicting future frames of a video is challenging because it is difficult to learn the uncertainty of the underlying factors influencing their contents. In this paper, we propose a novel video prediction model, which has infinite-dimensional latent variables over the spatio-temporal domain. Specifically, we first decompose the video motion and content information, then take a neural stochastic differential equation to predict the temporal motion information, and finally, an image diffusion model autoregressively generates the video frame by conditioning on the predicted motion feature and the previous frame. The better expressiveness and stronger stochasticity learning capability of our model lead to state-of-the-art video prediction performances. As well, our model is able to achieve temporal continuous prediction, i.e., predicting in an unsupervised way the future video frames with an arbitrarily high frame rate. Our code is available at \url{this https URL}.
Abstract (translated)
预测视频的未来帧是一个具有挑战性的任务,因为很难学习影响其内容的潜在因素的不确定性。在本文中,我们提出了一个新颖的视频预测模型,该模型在时空域具有无限维的潜在变量。具体来说,我们首先将视频运动和内容信息分解,然后使用神经随机微分方程预测时间运动信息,最后,一个图像扩散模型通过条件生成视频帧,该模型基于预测的运动特征和前帧。我们模型的表达力和随机性学习能力使其达到最先进的视频预测性能。此外,我们的模型还能够实现时间连续预测,即以任意高帧率为自监督方式预测未来的视频帧。我们的代码可在此处下载。
URL
https://arxiv.org/abs/2312.06486