Paper Reading AI Learner

Consistent Generative Query Networks

2018-07-05 14:51:51
Ananya Kumar, S. M. Ali Eslami, Danilo J. Rezende, Marta Garnelo, Fabio Viola, Edward Lockhart, Murray Shanahan

Abstract

Stochastic video prediction is usually framed as an extrapolation problem where the goal is to sample a sequence of consecutive future image frames conditioned on a sequence of observed past frames. For the most part, algorithms for this task generate future video frames sequentially in an autoregressive fashion, which is slow and requires the input and output to be consecutive. We introduce a model that overcomes these drawbacks -- it learns to generate a global latent representation from an arbitrary set of frames within a video. This representation can then be used to simultaneously and efficiently sample any number of temporally consistent frames at arbitrary time-points in the video. We apply our model to synthetic video prediction tasks and achieve results that are comparable to state-of-the-art video prediction models. In addition, we demonstrate the flexibility of our model by applying it to 3D scene reconstruction where we condition on location instead of time. To the best of our knowledge, our model is the first to provide flexible and coherent prediction on stochastic video datasets, as well as consistent 3D scene samples. Please check the project website https://bit.ly/2jX7Vyu to view scene reconstructions and videos produced by our model.

Abstract (translated)

随机视频预测通常被构造为外推问题,其目标是以一系列观察到的过去帧为条件对连续的未来图像帧进行采样。在大多数情况下,此任务的算法以自回归方式顺序生成未来视频帧,这种方式很慢并且需要输入和输出是连续的。我们介绍了一个克服这些缺点的模型 - 它学会了从视频中的任意一组帧生成全局潜在表示。然后,该表示可以用于在视频中的任意时间点同时且有效地采样任意数量的时间上一致的帧。我们将模型应用于合成视频预测任务,并获得与最先进的视频预测模型相当的结果。此外,我们通过将其应用于3D场景重建来展示我们模型的灵活性,其中我们在位置而不是时间上进行调整。据我们所知,我们的模型是第一个为随机视频数据集提供灵活和连贯预测的模型,以及一致的3D场景样本。请查看项目网站https://bit.ly/2jX7Vyu,查看我们模型生成的场景重建和视频。

URL

https://arxiv.org/abs/1807.02033

PDF

https://arxiv.org/pdf/1807.02033.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot