Paper Reading AI Learner

Fast Fourier Inception Networks for Occluded Video Prediction

2023-06-17 13:27:29
Ping Li, Chenhan Zhang, Xianghua Xu

Abstract

Video prediction is a pixel-level task that generates future frames by employing the historical frames. There often exist continuous complex motions, such as object overlapping and scene occlusion in video, which poses great challenges to this task. Previous works either fail to well capture the long-term temporal dynamics or do not handle the occlusion masks. To address these issues, we develop the fully convolutional Fast Fourier Inception Networks for video prediction, termed \textit{FFINet}, which includes two primary components, \ie, the occlusion inpainter and the spatiotemporal translator. The former adopts the fast Fourier convolutions to enlarge the receptive field, such that the missing areas (occlusion) with complex geometric structures are filled by the inpainter. The latter employs the stacked Fourier transform inception module to learn the temporal evolution by group convolutions and the spatial movement by channel-wise Fourier convolutions, which captures both the local and the global spatiotemporal features. This encourages generating more realistic and high-quality future frames. To optimize the model, the recovery loss is imposed to the objective, \ie, minimizing the mean square error between the ground-truth frame and the recovery frame. Both quantitative and qualitative experimental results on five benchmarks, including Moving MNIST, TaxiBJ, Human3.6M, Caltech Pedestrian, and KTH, have demonstrated the superiority of the proposed approach. Our code is available at GitHub.

Abstract (translated)

视频预测是一个在像素级别上的任务,通过使用历史帧来生成未来的帧。通常存在连续的复杂的运动,例如视频中的物体重叠和场景遮挡,这给这个任务带来了巨大的挑战。以前的工作要么无法很好地捕捉到长期的时间动态特性,要么没有处理遮挡面具。为了解决这些问题,我们开发了 fully convolutional 的 Fast Fourier Inception Networks 来进行视频预测,称之为 \textit{FFINet},它包括两个主要组成部分,即遮挡涂鸦和时间空间翻译器。前者采用快速傅里叶卷积来扩大接收域,使得缺失的区域(遮挡)由涂鸦填充。后者使用堆叠的傅里叶变换入境模块来学习通过群体卷积和时间空间卷积的时间演化和空间移动,从而捕捉 local 和 global 的时间空间特性。这鼓励生成更加真实和高质量的未来的帧。为了优化模型,恢复损失被强加到目标上,即最小化实际帧和恢复帧之间的平方误差。在五个基准测试中,包括运动 MNIST、 TaxiBJ、人类3.6M、Caltech 步行和 KTH 的量化和定性实验结果都证明了该方法的优越性。我们的代码可在 GitHub 上找到。

URL

https://arxiv.org/abs/2306.10346

PDF

https://arxiv.org/pdf/2306.10346.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot