Abstract
Stochastic video prediction is usually framed as an extrapolation problem where the goal is to sample a sequence of consecutive future image frames conditioned on a sequence of observed past frames. For the most part, algorithms for this task generate future video frames sequentially in an autoregressive fashion, which is slow and requires the input and output to be consecutive. We introduce a model that overcomes these drawbacks -- it learns to generate a global latent representation from an arbitrary set of frames within a video. This representation can then be used to simultaneously and efficiently sample any number of temporally consistent frames at arbitrary time-points in the video. We apply our model to synthetic video prediction tasks and achieve results that are comparable to state-of-the-art video prediction models. In addition, we demonstrate the flexibility of our model by applying it to 3D scene reconstruction where we condition on location instead of time. To the best of our knowledge, our model is the first to provide flexible and coherent prediction on stochastic video datasets, as well as consistent 3D scene samples. Please check the project website https://bit.ly/2jX7Vyu to view scene reconstructions and videos produced by our model.
Abstract (translated)
随机视频预测通常被构造为外推问题,其目标是以一系列观察到的过去帧为条件对连续的未来图像帧进行采样。在大多数情况下,此任务的算法以自回归方式顺序生成未来视频帧,这种方式很慢并且需要输入和输出是连续的。我们介绍了一个克服这些缺点的模型 - 它学会了从视频中的任意一组帧生成全局潜在表示。然后,该表示可以用于在视频中的任意时间点同时且有效地采样任意数量的时间上一致的帧。我们将模型应用于合成视频预测任务,并获得与最先进的视频预测模型相当的结果。此外,我们通过将其应用于3D场景重建来展示我们模型的灵活性,其中我们在位置而不是时间上进行调整。据我们所知,我们的模型是第一个为随机视频数据集提供灵活和连贯预测的模型,以及一致的3D场景样本。请查看项目网站https://bit.ly/2jX7Vyu,查看我们模型生成的场景重建和视频。
URL
https://arxiv.org/abs/1807.02033