Paper Reading AI Learner

Predicting Long-horizon Futures by Conditioning on Geometry and Time

2024-04-17 16:56:31
Tarasha Khurana, Deva Ramanan

Abstract

Our work explores the task of generating future sensor observations conditioned on the past. We are motivated by `predictive coding' concepts from neuroscience as well as robotic applications such as self-driving vehicles. Predictive video modeling is challenging because the future may be multi-modal and learning at scale remains computationally expensive for video processing. To address both challenges, our key insight is to leverage the large-scale pretraining of image diffusion models which can handle multi-modality. We repurpose image models for video prediction by conditioning on new frame timestamps. Such models can be trained with videos of both static and dynamic scenes. To allow them to be trained with modestly-sized datasets, we introduce invariances by factoring out illumination and texture by forcing the model to predict (pseudo) depth, readily obtained for in-the-wild videos via off-the-shelf monocular depth networks. In fact, we show that simply modifying networks to predict grayscale pixels already improves the accuracy of video prediction. Given the extra controllability with timestamp conditioning, we propose sampling schedules that work better than the traditional autoregressive and hierarchical sampling strategies. Motivated by probabilistic metrics from the object forecasting literature, we create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes and a large vocabulary of objects. Our experiments illustrate the effectiveness of learning to condition on timestamps, and show the importance of predicting the future with invariant modalities.

Abstract (translated)

我们的工作探讨了根据过去生成未来传感器观测值的任务。我们受到来自神经科学中的预测性编码概念以及机器人应用(如自动驾驶车辆)的启发。预测视频建模具有挑战性,因为未来可能有多模态,而且学习规模巨大的视频处理仍然具有计算成本。为了应对这两个挑战,我们的关键洞见是利用大规模预训练图像扩散模型,该模型可以处理多模态。我们将图像模型用于视频预测,通过条件于新帧时间戳来约束模型。这样的模型可以用于静态和动态场景的视频训练。为了使它们能够使用规模较小的数据集进行训练,我们通过迫使模型预测(伪)深度来引入不变性,这是通过在野外视频通过标准的单目深度网络轻易获得的。事实上,我们发现,仅将网络修改为预测灰度像素就可以提高视频预测的准确性。鉴于时间戳约束的额外可控性,我们提出了优于传统自回归和分层采样策略的采样时间表。为了激发概率统计学文献中的动机,我们为多样室内和室外视频创建了一个基准,涵盖了从室内到室外场景的大规模词汇表。我们的实验证明了学习条件于时间戳的有效性,并表明预测未来与不变模态的重要性。

URL

https://arxiv.org/abs/2404.11554

PDF

https://arxiv.org/pdf/2404.11554.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot