Paper Reading AI Learner

MMVP: Motion-Matrix-based Video Prediction

2023-08-30 17:20:46
Yiqi Zhong, Luming Liang, Ilya Zharkov, Ulrich Neumann

Abstract

A central challenge of video prediction lies where the system has to reason the objects' future motions from image frames while simultaneously maintaining the consistency of their appearances across frames. This work introduces an end-to-end trainable two-stream video prediction framework, Motion-Matrix-based Video Prediction (MMVP), to tackle this challenge. Unlike previous methods that usually handle motion prediction and appearance maintenance within the same set of modules, MMVP decouples motion and appearance information by constructing appearance-agnostic motion matrices. The motion matrices represent the temporal similarity of each and every pair of feature patches in the input frames, and are the sole input of the motion prediction module in MMVP. This design improves video prediction in both accuracy and efficiency, and reduces the model size. Results of extensive experiments demonstrate that MMVP outperforms state-of-the-art systems on public data sets by non-negligible large margins (about 1 db in PSNR, UCF Sports) in significantly smaller model sizes (84% the size or smaller). Please refer to this https URL for the official code and the datasets used in this paper.

Abstract (translated)

视频预测的一个核心挑战是在同时保持图像帧之间外观一致性的情况下,从图像帧中推理物体的未来运动。这项工作介绍了一种端到端可训练的双向视频预测框架,即运动矩阵视频预测(MMVP),以解决这个挑战。与以前的方法通常在同一模块内处理运动预测和外观维护不同,MMVP通过构建外观无关的运动矩阵将运动和外观信息分离。运动矩阵代表输入帧中每个特征 patch 之间的时间相似性,是MMVP 运动预测模块的唯一输入。这个设计提高了视频预测的准确性和效率,并减小了模型大小。广泛的实验结果表明,MMVP在公共数据集上比最先进的系统在小模型规模下显著表现更好(PSNR:约1 db,UCF Sports:84%大小或更小)。请参考此httpsURL以查看官方代码和本文使用的数据集。

URL

https://arxiv.org/abs/2308.16154

PDF

https://arxiv.org/pdf/2308.16154.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot