Paper Reading AI Learner

A Dynamic Multi-Scale Voxel Flow Network for Video Prediction

2023-03-17 10:42:05
Xiaotao Hu, Zhewei Huang, Ailin Huang, Jun Xu, Shuchang Zhou

Abstract

The performance of video prediction has been greatly boosted by advanced deep neural networks. However, most of the current methods suffer from large model sizes and require extra inputs, e.g., semantic/depth maps, for promising performance. For efficiency consideration, in this paper, we propose a Dynamic Multi-scale Voxel Flow Network (DMVFN) to achieve better video prediction performance at lower computational costs with only RGB images, than previous methods. The core of our DMVFN is a differentiable routing module that can effectively perceive the motion scales of video frames. Once trained, our DMVFN selects adaptive sub-networks for different inputs at the inference stage. Experiments on several benchmarks demonstrate that our DMVFN is an order of magnitude faster than Deep Voxel Flow and surpasses the state-of-the-art iterative-based OPT on generated image quality. Our code and demo are available at this https URL.

Abstract (translated)

视频预测的性能已经得到了高级深度学习网络的大大提高。然而,当前的方法大多数都面临着大型模型大小的问题,并需要额外的输入,例如语义/深度地图,以表现出良好的性能。为了考虑效率,在本文中,我们提出了一种动态多尺度 Voxel 流网络(DMVFN),可以在仅使用RGB图像的情况下,比先前方法实现更好的视频预测性能,而代价更低的计算成本。我们 DMVFN 的核心是一种可区分的路由模块,可以有效地感知视频帧的运动尺度。一旦训练完成,我们的 DMVFN 在推理阶段选择自适应子网络,以不同的输入。对多个基准测试对象的实验表明,我们的 DMVFN 比深度 Voxel 流更快,并且在生成图像质量方面超越了最先进的迭代基于优化方法。我们的代码和演示可以在这个 https URL 上找到。

URL

https://arxiv.org/abs/2303.09875

PDF

https://arxiv.org/pdf/2303.09875.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot