Paper Reading AI Learner

ViP-Mixer: A Convolutional Mixer for Video Prediction

2023-11-20 11:28:18
Xin Zheng, Ziang Peng, Yuan Cao, Hongming Shan, Junping Zhang

Abstract

Video prediction aims to predict future frames from a video's previous content. Existing methods mainly process video data where the time dimension mingles with the space and channel dimensions from three distinct angles: as a sequence of individual frames, as a 3D volume in spatiotemporal coordinates, or as a stacked image where frames are treated as separate channels. Most of them generally focus on one of these perspectives and may fail to fully exploit the relationships across different dimensions. To address this issue, this paper introduces a convolutional mixer for video prediction, termed ViP-Mixer, to model the spatiotemporal evolution in the latent space of an autoencoder. The ViP-Mixers are stacked sequentially and interleave feature mixing at three levels: frames, channels, and locations. Extensive experiments demonstrate that our proposed method achieves new state-of-the-art prediction performance on three benchmark video datasets covering both synthetic and real-world scenarios.

Abstract (translated)

视频预测旨在预测视频的先前内容中的未来帧。现有的方法主要处理时间维度与空间和通道维度从三个不同角度混合的视频数据:作为一系列单独的帧,作为时空坐标中的3D体积,或作为堆叠图像,其中帧被视为独立的通道。大多数方法通常集中于其中的一个角度,并可能无法充分利用不同维度之间的关系。为了解决这个问题,本文提出了一种用于视频预测的卷积混合器,称为ViP-Mixer,以建模自编码器中潜在空间的时间和空间演化。ViP-Mixer逐层堆叠,并在帧、通道和位置三个级别 interleave 特征混合。大量实验证明,我们提出的方法在覆盖 both synthetic 和 real-world scenario 的三个基准视频数据集上实现了最先进的预测性能。

URL

https://arxiv.org/abs/2311.11683

PDF

https://arxiv.org/pdf/2311.11683.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot