Abstract
Our work explores the task of generating future sensor observations conditioned on the past. We are motivated by `predictive coding' concepts from neuroscience as well as robotic applications such as self-driving vehicles. Predictive video modeling is challenging because the future may be multi-modal and learning at scale remains computationally expensive for video processing. To address both challenges, our key insight is to leverage the large-scale pretraining of image diffusion models which can handle multi-modality. We repurpose image models for video prediction by conditioning on new frame timestamps. Such models can be trained with videos of both static and dynamic scenes. To allow them to be trained with modestly-sized datasets, we introduce invariances by factoring out illumination and texture by forcing the model to predict (pseudo) depth, readily obtained for in-the-wild videos via off-the-shelf monocular depth networks. In fact, we show that simply modifying networks to predict grayscale pixels already improves the accuracy of video prediction. Given the extra controllability with timestamp conditioning, we propose sampling schedules that work better than the traditional autoregressive and hierarchical sampling strategies. Motivated by probabilistic metrics from the object forecasting literature, we create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes and a large vocabulary of objects. Our experiments illustrate the effectiveness of learning to condition on timestamps, and show the importance of predicting the future with invariant modalities.
Abstract (translated)
我们的工作探讨了根据过去生成未来传感器观测值的任务。我们受到来自神经科学中的预测性编码概念以及机器人应用(如自动驾驶车辆)的启发。预测视频建模具有挑战性,因为未来可能有多模态,而且学习规模巨大的视频处理仍然具有计算成本。为了应对这两个挑战,我们的关键洞见是利用大规模预训练图像扩散模型,该模型可以处理多模态。我们将图像模型用于视频预测,通过条件于新帧时间戳来约束模型。这样的模型可以用于静态和动态场景的视频训练。为了使它们能够使用规模较小的数据集进行训练,我们通过迫使模型预测(伪)深度来引入不变性,这是通过在野外视频通过标准的单目深度网络轻易获得的。事实上,我们发现,仅将网络修改为预测灰度像素就可以提高视频预测的准确性。鉴于时间戳约束的额外可控性,我们提出了优于传统自回归和分层采样策略的采样时间表。为了激发概率统计学文献中的动机,我们为多样室内和室外视频创建了一个基准,涵盖了从室内到室外场景的大规模词汇表。我们的实验证明了学习条件于时间戳的有效性,并表明预测未来与不变模态的重要性。
URL
https://arxiv.org/abs/2404.11554