Abstract
Video prediction is a pixel-level task that generates future frames by employing the historical frames. There often exist continuous complex motions, such as object overlapping and scene occlusion in video, which poses great challenges to this task. Previous works either fail to well capture the long-term temporal dynamics or do not handle the occlusion masks. To address these issues, we develop the fully convolutional Fast Fourier Inception Networks for video prediction, termed \textit{FFINet}, which includes two primary components, \ie, the occlusion inpainter and the spatiotemporal translator. The former adopts the fast Fourier convolutions to enlarge the receptive field, such that the missing areas (occlusion) with complex geometric structures are filled by the inpainter. The latter employs the stacked Fourier transform inception module to learn the temporal evolution by group convolutions and the spatial movement by channel-wise Fourier convolutions, which captures both the local and the global spatiotemporal features. This encourages generating more realistic and high-quality future frames. To optimize the model, the recovery loss is imposed to the objective, \ie, minimizing the mean square error between the ground-truth frame and the recovery frame. Both quantitative and qualitative experimental results on five benchmarks, including Moving MNIST, TaxiBJ, Human3.6M, Caltech Pedestrian, and KTH, have demonstrated the superiority of the proposed approach. Our code is available at GitHub.
Abstract (translated)
视频预测是一个在像素级别上的任务,通过使用历史帧来生成未来的帧。通常存在连续的复杂的运动,例如视频中的物体重叠和场景遮挡,这给这个任务带来了巨大的挑战。以前的工作要么无法很好地捕捉到长期的时间动态特性,要么没有处理遮挡面具。为了解决这些问题,我们开发了 fully convolutional 的 Fast Fourier Inception Networks 来进行视频预测,称之为 \textit{FFINet},它包括两个主要组成部分,即遮挡涂鸦和时间空间翻译器。前者采用快速傅里叶卷积来扩大接收域,使得缺失的区域(遮挡)由涂鸦填充。后者使用堆叠的傅里叶变换入境模块来学习通过群体卷积和时间空间卷积的时间演化和空间移动,从而捕捉 local 和 global 的时间空间特性。这鼓励生成更加真实和高质量的未来的帧。为了优化模型,恢复损失被强加到目标上,即最小化实际帧和恢复帧之间的平方误差。在五个基准测试中,包括运动 MNIST、 TaxiBJ、人类3.6M、Caltech 步行和 KTH 的量化和定性实验结果都证明了该方法的优越性。我们的代码可在 GitHub 上找到。
URL
https://arxiv.org/abs/2306.10346