Paper Reading AI Learner

Implicit Stacked Autoregressive Model for Video Prediction

2023-03-14 12:41:56
Minseok Seo, Hakjin Lee, Doyi Kim, Junghoon Seo

Abstract

Future frame prediction has been approached through two primary methods: autoregressive and non-autoregressive. Autoregressive methods rely on the Markov assumption and can achieve high accuracy in the early stages of prediction when errors are not yet accumulated. However, their performance tends to decline as the number of time steps increases. In contrast, non-autoregressive methods can achieve relatively high performance but lack correlation between predictions for each time step. In this paper, we propose an Implicit Stacked Autoregressive Model for Video Prediction (IAM4VP), which is an implicit video prediction model that applies a stacked autoregressive method. Like non-autoregressive methods, stacked autoregressive methods use the same observed frame to estimate all future frames. However, they use their own predictions as input, similar to autoregressive methods. As the number of time steps increases, predictions are sequentially stacked in the queue. To evaluate the effectiveness of IAM4VP, we conducted experiments on three common future frame prediction benchmark datasets and weather\&climate prediction benchmark datasets. The results demonstrate that our proposed model achieves state-of-the-art performance.

Abstract (translated)

未来的帧预测已经通过两个主要方法:自回归和非线性非自回归方法。自回归方法依赖于马尔可夫假设,并在预测的早期阶段,当错误尚未累积时,可以实现高精度。然而,他们的性能随着时间步数的增加而倾向于下降。相比之下,非线性非自回归方法可以实现较高的性能,但它们在每个时间步之间的预测之间缺乏相关性。在本文中,我们提出了一种隐含的堆叠自回归模型,即视频预测隐含的堆叠自回归模型(IAM4VP),这是一种应用堆叠自回归方法的视频预测模型。与非线性非自回归方法一样,堆叠自回归方法使用相同的观察帧来估计所有未来的帧。然而,它们使用自己的预测作为输入,类似于自回归方法。随着时间步数的增加,预测依次堆叠在队列中。为了评估IAM4VP的有效性,我们进行了三种常见的未来帧预测基准数据和天气\&气候预测基准数据的实验。结果表明,我们提出的模型实现了先进的性能。

URL

https://arxiv.org/abs/2303.07849

PDF

https://arxiv.org/pdf/2303.07849.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot