Abstract
Video prediction aims to predict future frames from a video's previous content. Existing methods mainly process video data where the time dimension mingles with the space and channel dimensions from three distinct angles: as a sequence of individual frames, as a 3D volume in spatiotemporal coordinates, or as a stacked image where frames are treated as separate channels. Most of them generally focus on one of these perspectives and may fail to fully exploit the relationships across different dimensions. To address this issue, this paper introduces a convolutional mixer for video prediction, termed ViP-Mixer, to model the spatiotemporal evolution in the latent space of an autoencoder. The ViP-Mixers are stacked sequentially and interleave feature mixing at three levels: frames, channels, and locations. Extensive experiments demonstrate that our proposed method achieves new state-of-the-art prediction performance on three benchmark video datasets covering both synthetic and real-world scenarios.
Abstract (translated)
视频预测旨在预测视频的先前内容中的未来帧。现有的方法主要处理时间维度与空间和通道维度从三个不同角度混合的视频数据:作为一系列单独的帧,作为时空坐标中的3D体积,或作为堆叠图像,其中帧被视为独立的通道。大多数方法通常集中于其中的一个角度,并可能无法充分利用不同维度之间的关系。为了解决这个问题,本文提出了一种用于视频预测的卷积混合器,称为ViP-Mixer,以建模自编码器中潜在空间的时间和空间演化。ViP-Mixer逐层堆叠,并在帧、通道和位置三个级别 interleave 特征混合。大量实验证明,我们提出的方法在覆盖 both synthetic 和 real-world scenario 的三个基准视频数据集上实现了最先进的预测性能。
URL
https://arxiv.org/abs/2311.11683