Specifying reward signals that allow agents to learn complex behaviors is a long-standing challenge in reinforcement learning. A promising approach is to extract preferences for behaviors from unlabeled videos, which are widely available on the internet. We present Video Prediction Rewards (VIPER), an algorithm that leverages pretrained video prediction models as action-free reward signals for reinforcement learning. Specifically, we first train an autoregressive transformer on expert videos and then use the video prediction likelihoods as reward signals for a reinforcement learning agent. VIPER enables expert-level control without programmatic task rewards across a wide range of DMC, Atari, and RLBench tasks. Moreover, generalization of the video prediction model allows us to derive rewards for an out-of-distribution environment where no expert data is available, enabling cross-embodiment generalization for tabletop manipulation. We see our work as starting point for scalable reward specification from unlabeled videos that will benefit from the rapid advances in generative modeling. Source code and datasets are available on the project website: this https URL
指定能够让代理学习复杂行为的奖励信号是 reinforcement learning 中的长期挑战。一个有前途的方法是从未标记的视频中提取对行为的偏好,这些视频在互联网上广泛可得。我们提出了视频预测奖励(VIPER),该算法利用训练好的视频预测模型作为 reinforcement learning 中的无行动奖励信号。具体来说,我们首先训练了专家视频的自回归Transformer,然后使用视频预测概率作为奖励信号,为 reinforcement learning 代理提供专家级别的控制,而无需程序任务奖励。此外,视频预测模型的泛化使我们能够在没有专家数据可用的分布外环境中提取奖励,从而实现桌面操作对象的跨身体化身泛化。我们认为我们的工作可以作为从未标记视频 scalable 奖励 specification 的起点。源代码和数据集可在项目网站上找到:这个 https URL。
https://arxiv.org/abs/2305.14343
In this paper, we investigate the challenge of spatio-temporal video prediction, which involves generating future videos based on historical data streams. Existing approaches typically utilize external information such as semantic maps to enhance video prediction, which often neglect the inherent physical knowledge embedded within videos. Furthermore, their high computational demands could impede their applications for high-resolution videos. To address these constraints, we introduce a novel approach called Physics-assisted Spatio-temporal Network (PastNet) for generating high-quality video predictions. The core of our PastNet lies in incorporating a spectral convolution operator in the Fourier domain, which efficiently introduces inductive biases from the underlying physical laws. Additionally, we employ a memory bank with the estimated intrinsic dimensionality to discretize local features during the processing of complex spatio-temporal signals, thereby reducing computational costs and facilitating efficient high-resolution video prediction. Extensive experiments on various widely-used datasets demonstrate the effectiveness and efficiency of the proposed PastNet compared with state-of-the-art methods, particularly in high-resolution scenarios.
在本文中,我们研究时空视频预测的挑战,这涉及基于历史数据流生成未来视频的方法。现有的方法通常使用外部信息,如语义地图,以增强视频预测,但常常忽略了视频中隐含的物理知识。此外,它们的高度计算要求可能会阻碍其对高分辨率视频的应用。为了解决这些限制,我们介绍了一种新的方法,称为物理学辅助时空网络(PastNet),以生成高质量的视频预测。我们的PastNet的核心是在傅里叶域中引入光谱卷积操作,有效地引入基于底层物理规律的转移偏见。此外,我们在处理复杂的时空信号时使用估计 intrinsic dimensionality 的内存银行来离散化局部特征,从而减少计算成本,并促进高效高分辨率视频预测。对多个广泛应用数据集进行广泛的实验表明, proposed pastNet 与最先进的方法相比,特别是在高分辨率场景下,其有效性和效率是有效的。
https://arxiv.org/abs/2305.11421
Object-centric learning aims to represent visual data with a set of object entities (a.k.a. slots), providing structured representations that enable systematic generalization. Leveraging advanced architectures like Transformers, recent approaches have made significant progress in unsupervised object discovery. In addition, slot-based representations hold great potential for generative modeling, such as controllable image generation and object manipulation in image editing. However, current slot-based methods often produce blurry images and distorted objects, exhibiting poor generative modeling capabilities. In this paper, we focus on improving slot-to-image decoding, a crucial aspect for high-quality visual generation. We introduce SlotDiffusion -- an object-centric Latent Diffusion Model (LDM) designed for both image and video data. Thanks to the powerful modeling capacity of LDMs, SlotDiffusion surpasses previous slot models in unsupervised object segmentation and visual generation across six datasets. Furthermore, our learned object features can be utilized by existing object-centric dynamics models, improving video prediction quality and downstream temporal reasoning tasks. Finally, we demonstrate the scalability of SlotDiffusion to unconstrained real-world datasets such as PASCAL VOC and COCO, when integrated with self-supervised pre-trained image encoders.
对象中心学习的目标是使用对象实体(也称为槽)来代表视觉数据,提供结构性表示,从而实现系统性泛化。利用Transformer等高级架构,最近的方法在未监督对象发现方面取得了显著进展。此外,基于槽的表示在生成模型方面具有巨大的潜力,例如在图像编辑中可控制的图像生成和对象操纵。然而,当前基于槽的方法往往产生模糊的图像和扭曲的对象,表现出生成模型能力的不足。在本文中,我们关注改进槽到图像解码,这是高质量视觉生成的关键方面。我们介绍了slotDiffusion——一个针对图像和视频数据的 object-centric Latent Diffusion Model(LDM)。由于LDM的强大建模能力,slotDiffusion在六 datasets 的未监督对象分割和视觉生成方面超越了以前的槽模型。此外,我们学习的对象特征可以由现有的对象中心动态模型使用,提高视频预测质量和后续的时间推理任务。最后,我们展示了slotDiffusion对无约束的现实世界数据集如PASCAL VOC和COCO的 scalability。在与自监督预训练图像编码器集成时,我们证明了slotDiffusion的可扩展性。
https://arxiv.org/abs/2305.11281
Video is a promising source of knowledge for embodied agents to learn models of the world's dynamics. Large deep networks have become increasingly effective at modeling complex video data in a self-supervised manner, as evaluated by metrics based on human perceptual similarity or pixel-wise comparison. However, it remains unclear whether current metrics are accurate indicators of performance on downstream tasks. We find empirically that for planning robotic manipulation, existing metrics can be unreliable at predicting execution success. To address this, we propose a benchmark for action-conditioned video prediction in the form of a control benchmark that evaluates a given model for simulated robotic manipulation through sampling-based planning. Our benchmark, Video Prediction for Visual Planning ($VP^2$), includes simulated environments with 11 task categories and 310 task instance definitions, a full planning implementation, and training datasets containing scripted interaction trajectories for each task category. A central design goal of our benchmark is to expose a simple interface -- a single forward prediction call -- so it is straightforward to evaluate almost any action-conditioned video prediction model. We then leverage our benchmark to study the effects of scaling model size, quantity of training data, and model ensembling by analyzing five highly-performant video prediction models, finding that while scale can improve perceptual quality when modeling visually diverse settings, other attributes such as uncertainty awareness can also aid planning performance.
视频是身体参与 agents 学习世界动态模型的有前途的知识来源。大型深网络在以自我监督方式建模复杂的视频数据方面变得越来越有效,以基于人类感知相似性或像素比较的度量指标进行评估。然而,目前仍不清楚当前度量指标是否准确地反映了下游任务的表现。我们经验证,对于规划机器人操纵,现有的度量指标在预测执行成功方面的可靠性是不可靠的。为了解决这个问题,我们提出了一种控制基准,以作为评估通过采样计划模拟机器人操纵给定模型的基准。我们的基准是“视频预测Visual Planning”(VP2),包括模拟环境,每个任务类别具有11个任务分类和310个任务实例定义,以及完整的计划实施和包含每个任务类别的编程交互轨迹的训练数据集。我们的基准的核心设计目标是暴露一个简单的接口——一个单一的预测向前调用——以便几乎可以直接评估任何行动条件的视频预测模型。然后我们利用我们的基准来分析五家表现优异的视频预测模型,发现虽然规模可以在建模视觉多样性的情境中改善感知质量,但其他属性,如不确定性意识,也可以协助规划性能。
https://arxiv.org/abs/2304.13723
In this paper, we explore the impact of adding tactile sensation to video prediction models for physical robot interactions. Predicting the impact of robotic actions on the environment is a fundamental challenge in robotics. Current methods leverage visual and robot action data to generate video predictions over a given time period, which can then be used to adjust robot actions. However, humans rely on both visual and tactile feedback to develop and maintain a mental model of their physical surroundings. In this paper, we investigate the impact of integrating tactile feedback into video prediction models for physical robot interactions. We propose three multi-modal integration approaches and compare the performance of these tactile-enhanced video prediction models. Additionally, we introduce two new datasets of robot pushing that use a magnetic-based tactile sensor for unsupervised learning. The first dataset contains visually identical objects with different physical properties, while the second dataset mimics existing robot-pushing datasets of household object clusters. Our results demonstrate that incorporating tactile feedback into video prediction models improves scene prediction accuracy and enhances the agent's perception of physical interactions and understanding of cause-effect relationships during physical robot interactions.
在本文中,我们对将触觉感觉添加到视频预测模型中对于实际机器人交互的影响进行了研究。预测机器人行动对环境的影响是机器人学中的一个基本挑战。目前的方法利用视觉和机器人行动数据生成在特定时间段内的视频预测,然后可用于调整机器人行动。然而,人类依赖视觉和触觉反馈来发展和维持他们的物理周围环境的心理模型。在本文中,我们研究了将触觉反馈整合到视频预测模型中对于实际机器人交互的影响。我们提出了三种多模态融合方法,并比较了这些触觉增强的视频预测模型的性能。此外,我们介绍了两个新的机器人推动数据集,这些数据集使用基于磁性的触觉传感器进行无监督学习。第一个数据集包含具有不同物理性质的视觉相同的物体,而第二个数据集模拟了家庭物品簇现有的机器人推动数据集。我们的结果表明,将触觉反馈添加到视频预测模型中可以提高场景预测的准确性,增强Agent在物理机器人交互中对物理互动和因果关系的理解。
https://arxiv.org/abs/2304.11193
The drastic variation of motion in spatial and temporal dimensions makes the video prediction task extremely challenging. Existing RNN models obtain higher performance by deepening or widening the model. They obtain the multi-scale features of the video only by stacking layers, which is inefficient and brings unbearable training costs (such as memory, FLOPs, and training time). Different from them, this paper proposes a spatiotemporal multi-scale model called MS-LSTM wholly from a multi-scale perspective. On the basis of stacked layers, MS-LSTM incorporates two additional efficient multi-scale designs to fully capture spatiotemporal context information. Concretely, we employ LSTMs with mirrored pyramid structures to construct spatial multi-scale representations and LSTMs with different convolution kernels to construct temporal multi-scale representations. Detailed comparison experiments with eight baseline models on four video datasets show that MS-LSTM has better performance but lower training costs.
在空间和时间维度上的变化剧烈使得视频预测任务极其具有挑战性。现有的RNN模型通过加深或拓宽模型来获得更高的性能。他们只能通过堆叠层来获取视频的多尺度特征,这非常低效并且带来了难以承受的训练成本(如内存、Flops、训练时间)。与它们不同,本文提出了一种名为MS-LSTM的时空多尺度模型,完全从多尺度的角度提出。基于堆叠的层,MS-LSTM引入了两个高效的多尺度设计,以完全捕捉时空上下文信息。具体而言,我们使用具有 mirrorPyramid结构的LSTM构建空间多尺度表示,使用不同的卷积核构建时间多尺度表示。在四个视频数据集上与8个基准模型进行详细的比较实验,结果表明MS-LSTM具有更好的性能,但训练成本更低。
https://arxiv.org/abs/2304.07724
We present a novel approach for modeling vegetation response to weather in Europe as measured by the Sentinel 2 satellite. Existing satellite imagery forecasting approaches focus on photorealistic quality of the multispectral images, while derived vegetation dynamics have not yet received as much attention. We leverage both spatial and temporal context by extending state-of-the-art video prediction methods with weather guidance. We extend the EarthNet2021 dataset to be suitable for vegetation modeling by introducing a learned cloud mask and an appropriate evaluation scheme. Qualitative and quantitative experiments demonstrate superior performance of our approach over a wide variety of baseline methods, including leading approaches to satellite imagery forecasting. Additionally, we show how our modeled vegetation dynamics can be leveraged in a downstream task: inferring gross primary productivity for carbon monitoring. To the best of our knowledge, this work presents the first models for continental-scale vegetation modeling at fine resolution able to capture anomalies beyond the seasonal cycle, thereby paving the way for predictive assessments of vegetation status.
我们提出了一种新方法,用于模拟欧洲根据Sentinel 2卫星测量的气象对植被反应。现有的卫星图像预测方法主要关注多光谱图像的逼真质量,但相关的植被动态还未得到足够的关注和重视。我们利用时空 context 的优势,通过引入先进的视频预测方法,结合气象指导,将 EarthNet2021 数据集扩展为适合植被建模,并引入了学习 cloud mask 和适当的评估 scheme。定性和定量实验表明,我们的方法比许多基准方法表现更好,包括卫星图像预测的领先地位。此外,我们展示了我们模型的植被动态如何在后续任务中利用:推断碳监测的 gross primary productivity。据我们所知,这项工作提出了大陆尺度植被建模的高精度模型,能够捕捉到季节之外异常情况,为预测植被状态进行评估开辟了道路。
https://arxiv.org/abs/2303.16198
Imagining the future trajectory is the key for robots to make sound planning and successfully reach their goals. Therefore, text-conditioned video prediction (TVP) is an essential task to facilitate general robot policy learning, i.e., predicting future video frames with a given language instruction and reference frames. It is a highly challenging task to ground task-level goals specified by instructions and high-fidelity frames together, requiring large-scale data and computation. To tackle this task and empower robots with the ability to foresee the future, we propose a sample and computation-efficient model, named \textbf{Seer}, by inflating the pretrained text-to-image (T2I) stable diffusion models along the temporal axis. We inflate the denoising U-Net and language conditioning model with two novel techniques, Autoregressive Spatial-Temporal Attention and Frame Sequential Text Decomposer, to propagate the rich prior knowledge in the pretrained T2I models across the frames. With the well-designed architecture, Seer makes it possible to generate high-fidelity, coherent, and instruction-aligned video frames by fine-tuning a few layers on a small amount of data. The experimental results on Something Something V2 (SSv2) and Bridgedata datasets demonstrate our superior video prediction performance with around 210-hour training on 4 RTX 3090 GPUs: decreasing the FVD of the current SOTA model from 290 to 200 on SSv2 and achieving at least 70\% preference in the human evaluation.
想象未来的轨迹是机器人进行计划的关键,并成功达到其目标。因此,基于文本的图像预测(TVP)是促进一般机器人政策学习的必要的任务,即预测给定语言指令和参考帧的未来视频帧。这是一个高度挑战的任务,以 ground task-level goals 协同指定的高保真度帧的目标,需要大规模数据和计算。为了解决这个问题,并赋予机器人预见未来的能力,我们提议一个样本和计算效率高的模型,名为 \textbf{Seer},通过在时间轴上膨胀预训练的文本到图像稳定扩散模型。我们膨胀去噪 U-Net 和语言 conditioning 模型,使用两个新技术,即自回归的空间和时间注意力和帧序列文本分解,来传播预训练的 T2I 模型中的丰富先验知识,在每个帧上传播。通过精心设计的架构,Seer 使得生成高保真度、同步和指令对齐的视频帧通过微调少量的层在少量数据上fine-tuning能够实现。在 something something V2(SSv2)和Bridgedata 数据集的实验结果证明了我们在大约 210 小时训练的 4 RTX 3090 GPU 上表现出卓越的视频预测性能:将当前领先的模型 FVD 从 290 降低到 200 在SSv2 上,并在人类评估中实现至少 70\% 的偏好。
https://arxiv.org/abs/2303.14897
The performance of video prediction has been greatly boosted by advanced deep neural networks. However, most of the current methods suffer from large model sizes and require extra inputs, e.g., semantic/depth maps, for promising performance. For efficiency consideration, in this paper, we propose a Dynamic Multi-scale Voxel Flow Network (DMVFN) to achieve better video prediction performance at lower computational costs with only RGB images, than previous methods. The core of our DMVFN is a differentiable routing module that can effectively perceive the motion scales of video frames. Once trained, our DMVFN selects adaptive sub-networks for different inputs at the inference stage. Experiments on several benchmarks demonstrate that our DMVFN is an order of magnitude faster than Deep Voxel Flow and surpasses the state-of-the-art iterative-based OPT on generated image quality. Our code and demo are available at this https URL.
视频预测的性能已经得到了高级深度学习网络的大大提高。然而,当前的方法大多数都面临着大型模型大小的问题,并需要额外的输入,例如语义/深度地图,以表现出良好的性能。为了考虑效率,在本文中,我们提出了一种动态多尺度 Voxel 流网络(DMVFN),可以在仅使用RGB图像的情况下,比先前方法实现更好的视频预测性能,而代价更低的计算成本。我们 DMVFN 的核心是一种可区分的路由模块,可以有效地感知视频帧的运动尺度。一旦训练完成,我们的 DMVFN 在推理阶段选择自适应子网络,以不同的输入。对多个基准测试对象的实验表明,我们的 DMVFN 比深度 Voxel 流更快,并且在生成图像质量方面超越了最先进的迭代基于优化方法。我们的代码和演示可以在这个 https URL 上找到。
https://arxiv.org/abs/2303.09875
Video prediction is a complex time-series forecasting task with great potential in many use cases. However, conventional methods overemphasize accuracy while ignoring the slow prediction speed caused by complicated model structures that learn too much redundant information with excessive GPU memory consumption. Furthermore, conventional methods mostly predict frames sequentially (frame-by-frame) and thus are hard to accelerate. Consequently, valuable use cases such as real-time danger prediction and warning cannot achieve fast enough inference speed to be applicable in reality. Therefore, we propose a transformer-based keypoint prediction neural network (TKN), an unsupervised learning method that boost the prediction process via constrained information extraction and parallel prediction scheme. TKN is the first real-time video prediction solution to our best knowledge, while significantly reducing computation costs and maintaining other performance. Extensive experiments on KTH and Human3.6 datasets demonstrate that TKN predicts 11 times faster than existing methods while reducing memory consumption by 17.4% and achieving state-of-the-art prediction performance on average.
视频预测是一种在许多使用场景中具有巨大潜力的复杂的时间序列预测任务。然而,传统的方法过于强调准确性,而忽视了由于复杂的模型结构导致学习过多的冗余信息以及GPU内存消耗过高所带来的缓慢预测速度。此外,传统的方法大多顺序预测帧(帧一帧),因此很难加速。因此,我们提出了基于Transformer的关键帧预测神经网络(TKN),一种无监督学习方法,通过限制信息提取和并行预测方案来增强预测过程。TKN是我们所知的实时视频预测解决方案的第一个方法,同时显著降低计算成本并维持其他性能。在KTH和Human3.6数据集上进行广泛的实验表明,TKN预测速度比现有方法快11倍,同时减少了内存消耗,平均实现最先进的预测性能。
https://arxiv.org/abs/2303.09807
Future frame prediction has been approached through two primary methods: autoregressive and non-autoregressive. Autoregressive methods rely on the Markov assumption and can achieve high accuracy in the early stages of prediction when errors are not yet accumulated. However, their performance tends to decline as the number of time steps increases. In contrast, non-autoregressive methods can achieve relatively high performance but lack correlation between predictions for each time step. In this paper, we propose an Implicit Stacked Autoregressive Model for Video Prediction (IAM4VP), which is an implicit video prediction model that applies a stacked autoregressive method. Like non-autoregressive methods, stacked autoregressive methods use the same observed frame to estimate all future frames. However, they use their own predictions as input, similar to autoregressive methods. As the number of time steps increases, predictions are sequentially stacked in the queue. To evaluate the effectiveness of IAM4VP, we conducted experiments on three common future frame prediction benchmark datasets and weather\&climate prediction benchmark datasets. The results demonstrate that our proposed model achieves state-of-the-art performance.
未来的帧预测已经通过两个主要方法:自回归和非线性非自回归方法。自回归方法依赖于马尔可夫假设,并在预测的早期阶段,当错误尚未累积时,可以实现高精度。然而,他们的性能随着时间步数的增加而倾向于下降。相比之下,非线性非自回归方法可以实现较高的性能,但它们在每个时间步之间的预测之间缺乏相关性。在本文中,我们提出了一种隐含的堆叠自回归模型,即视频预测隐含的堆叠自回归模型(IAM4VP),这是一种应用堆叠自回归方法的视频预测模型。与非线性非自回归方法一样,堆叠自回归方法使用相同的观察帧来估计所有未来的帧。然而,它们使用自己的预测作为输入,类似于自回归方法。随着时间步数的增加,预测依次堆叠在队列中。为了评估IAM4VP的有效性,我们进行了三种常见的未来帧预测基准数据和天气\&气候预测基准数据的实验。结果表明,我们提出的模型实现了先进的性能。
https://arxiv.org/abs/2303.07849
Learning physical dynamics in a series of non-stationary environments is a challenging but essential task for model-based reinforcement learning (MBRL) with visual inputs. It requires the agent to consistently adapt to novel tasks without forgetting previous knowledge. In this paper, we present a new continual learning approach for visual dynamics modeling and explore its efficacy in visual control and forecasting. The key assumption is that an ideal world model can provide a non-forgetting environment simulator, which enables the agent to optimize the policy in a multi-task learning manner based on the imagined trajectories from the world model. To this end, we first propose the mixture world model that learns task-specific dynamics priors with a mixture of Gaussians, and then introduce a new training strategy to overcome catastrophic forgetting, which we call predictive experience replay. Finally, we extend these methods to continual RL and further address the value estimation problems with the exploratory-conservative behavior learning approach. Our model remarkably outperforms the naive combinations of existing continual learning and visual RL algorithms on DeepMind Control and Meta-World benchmarks with continual visual control tasks. It is also shown to effectively alleviate the forgetting of spatiotemporal dynamics in video prediction datasets with evolving domains.
在学习一系列非稳定环境的物理动态特性是一项具有挑战性但是必不可少的任务,这对于使用视觉输入的模型驱动强化学习(MBRL)来说尤其如此。该任务要求代理持续适应新任务而不会忘记先前的知识。在本文中,我们提出了一种新的视觉动态建模方法,并探索了它在视觉控制和预测方面的效力。关键假设是理想的世界模型可以提供一种不会遗忘的环境模拟器,从而使代理能够在世界模型的想象轨迹上优化政策,以多任务学习的方式。为此,我们提出了一种混合世界模型,通过学习任务特定的动态先验分布,使用高斯混合模型来学习,然后引入了一种新的训练策略,以克服灾难性的遗忘,我们称之为预测经验回放。最后,我们将这些方法扩展到持续强化学习,并进一步解决了探索性保守行为学习方法所带来的价值估计问题。我们的模型在DeepMind控制和Meta-World基准任务中与现有的视觉强化学习和视觉控制算法的盲目组合相比表现出卓越的性能。它还表明,能够在具有进化域的视频预测数据集上有效地减轻忘记时序动态的问题。
https://arxiv.org/abs/2303.06572
Motion, scene and object are three primary visual components of a video. In particular, objects represent the foreground, scenes represent the background, and motion traces their dynamics. Based on this insight, we propose a two-stage MOtion, Scene and Object decomposition framework (MOSO) for video prediction, consisting of MOSO-VQVAE and MOSO-Transformer. In the first stage, MOSO-VQVAE decomposes a previous video clip into the motion, scene and object components, and represents them as distinct groups of discrete tokens. Then, in the second stage, MOSO-Transformer predicts the object and scene tokens of the subsequent video clip based on the previous tokens and adds dynamic motion at the token level to the generated object and scene tokens. Our framework can be easily extended to unconditional video generation and video frame interpolation tasks. Experimental results demonstrate that our method achieves new state-of-the-art performance on five challenging benchmarks for video prediction and unconditional video generation: BAIR, RoboNet, KTH, KITTI and UCF101. In addition, MOSO can produce realistic videos by combining objects and scenes from different videos.
运动、场景和对象是视频的三个主要视觉组成部分。特别是,对象代表前景,场景代表背景,而运动则记录了它们的动态。基于这一见解,我们提出了一个两阶段的运动、场景和对象分解框架(MOSO),由MOSO-VQVAE和MOSO-Transformer组成。在第一阶段,MOSO-VQVAE将先前的视频片段分解为运动、场景和对象组成部分,并将它们表示为独立的离散代币群组。然后在第二阶段,MOSO-Transformer基于先前代币预测后续视频片段中的物体和场景代币,并在代币级别上添加动态运动。我们的框架可以轻松扩展到无条件视频生成和视频帧插值任务。实验结果显示,我们的方法和在视频预测和无条件视频生成五个挑战基准上的新高性能:BAIR、RoboNet、KTH、KITTI和UCF101。此外,MOSO可以通过结合来自不同视频的对象和场景来产生真实的视频。
https://arxiv.org/abs/2303.03684
We propose a novel framework for the task of object-centric video prediction, i.e., extracting the compositional structure of a video sequence, as well as modeling objects dynamics and interactions from visual observations in order to predict the future object states, from which we can then generate subsequent video frames. With the goal of learning meaningful spatio-temporal object representations and accurately forecasting object states, we propose two novel object-centric video predictor (OCVP) transformer modules, which decouple the processing of temporal dynamics and object interactions, thus presenting an improved prediction performance. In our experiments, we show how our object-centric prediction framework utilizing our OCVP predictors outperforms object-agnostic video prediction models on two different datasets, while maintaining consistent and accurate object representations.
我们提出了一种新的框架,用于对象中心的视频预测任务,即从视频序列中抽取构成性结构,并基于视觉观察建模对象动态和交互,以预测未来对象状态,并从该状态中生成后续视频帧。为了实现学习有意义的空间时间对象表示以及准确预测对象状态的目标,我们提出了两个全新的对象中心视频预测(OCVP)Transformer模块,它们将时间动态处理与对象交互处理分离,从而提高了预测性能。在我们的实验中,我们展示了如何使用我们的OCVP预测器我们的对象中心预测框架在两个不同数据集上优于对象无关的视频预测模型,同时保持了一致性和准确的对象表示。
https://arxiv.org/abs/2302.11850
Prediction of dynamic environment is crucial to safe navigation of an autonomous vehicle. Urban traffic scenes are particularly challenging to forecast due to complex interactions between various dynamic agents, such as vehicles and vulnerable road users. Previous approaches have used egocentric occupancy grid maps to represent and predict dynamic environments. However, these predictions suffer from blurriness, loss of scene structure at turns, and vanishing of agents over longer prediction horizon. In this work, we propose a novel framework to make long-term predictions by representing the traffic scene in a fixed frame, referred as allo-centric occupancy grid. This allows for the static scene to remain fixed and to represent motion of the ego-vehicle on the grid like other agents'. We study the allo-centric grid prediction with different video prediction networks and validate the approach on the real-world Nuscenes dataset. The results demonstrate that the allo-centric grid representation significantly improves scene prediction, in comparison to the conventional ego-centric grid approach.
https://arxiv.org/abs/2301.04454
The task of video prediction and generation is known to be notoriously difficult, with the research in this area largely limited to short-term predictions. Though plagued with noise and stochasticity, videos consist of features that are organised in a spatiotemporal hierarchy, different features possessing different temporal dynamics. In this paper, we introduce Dynamic Latent Hierarchy (DLH) -- a deep hierarchical latent model that represents videos as a hierarchy of latent states that evolve over separate and fluid timescales. Each latent state is a mixture distribution with two components, representing the immediate past and the predicted future, causing the model to learn transitions only between sufficiently dissimilar states, while clustering temporally persistent states closer together. Using this unique property, DLH naturally discovers the spatiotemporal structure of a dataset and learns disentangled representations across its hierarchy. We hypothesise that this simplifies the task of modeling temporal dynamics of a video, improves the learning of long-term dependencies, and reduces error accumulation. As evidence, we demonstrate that DLH outperforms state-of-the-art benchmarks in video prediction, is able to better represent stochasticity, as well as to dynamically adjust its hierarchical and temporal structure. Our paper shows, among other things, how progress in representation learning can translate into progress in prediction tasks.
https://arxiv.org/abs/2212.14376
We are introducing a multi-scale predictive model for video prediction here, whose design is inspired by the "Predictive Coding" theories and "Coarse to Fine" approach. As a predictive coding model, it is updated by a combination of bottom-up and top-down information flows, which is different from traditional bottom-up training style. Its advantage is to reduce the dependence on input information and improve its ability to predict and generate images. Importantly, we achieve with a multi-scale approach -- higher level neurons generate coarser predictions (lower resolution), while the lower level generate finer predictions (higher resolution). This is different from the traditional predictive coding framework in which higher level predict the activity of neurons in lower level. To improve the predictive ability, we integrate an encoder-decoder network in the LSTM architecture and share the final encoded high-level semantic information between different levels. Additionally, since the output of each network level is an RGB image, a smaller LSTM hidden state can be used to retain and update the only necessary hidden information, avoiding being mapped to an overly discrete and complex space. In this way, we can reduce the difficulty of prediction and the computational overhead. Finally, we further explore the training strategies, to address the instability in adversarial training and mismatch between training and testing in long-term prediction. Code is available at this https URL.
https://arxiv.org/abs/2212.11642
Video prediction is a challenging computer vision task that has a wide range of applications. In this work, we present a new family of Transformer-based models for video prediction. Firstly, an efficient local spatial-temporal separation attention mechanism is proposed to reduce the complexity of standard Transformers. Then, a full autoregressive model, a partial autoregressive model and a non-autoregressive model are developed based on the new efficient Transformer. The partial autoregressive model has a similar performance with the full autoregressive model but a faster inference speed. The non-autoregressive model not only achieves a faster inference speed but also mitigates the quality degradation problem of the autoregressive counterparts, but it requires additional parameters and loss function for learning. Given the same attention mechanism, we conducted a comprehensive study to compare the proposed three video prediction variants. Experiments show that the proposed video prediction models are competitive with more complex state-of-the-art convolutional-LSTM based models. The source code is available at this https URL.
https://arxiv.org/abs/2212.06026
Existing state-of-the-art method for audio-visual conditioned video prediction uses the latent codes of the audio-visual frames from a multimodal stochastic network and a frame encoder to predict the next visual frame. However, a direct inference of per-pixel intensity for the next visual frame from the latent codes is extremely challenging because of the high-dimensional image space. To this end, we propose to decouple the audio-visual conditioned video prediction into motion and appearance modeling. The first part is the multimodal motion estimation module that learns motion information as optical flow from the given audio-visual clip. The second part is the context-aware refinement module that uses the predicted optical flow to warp the current visual frame into the next visual frame and refines it base on the given audio-visual context. Experimental results show that our method achieves competitive results on existing benchmarks.
https://arxiv.org/abs/2212.04679
The mainstream of the existing approaches for video prediction builds up their models based on a Single-In-Single-Out (SISO) architecture, which takes the current frame as input to predict the next frame in a recursive manner. This way often leads to severe performance degradation when they try to extrapolate a longer period of future, thus limiting the practical use of the prediction model. Alternatively, a Multi-In-Multi-Out (MIMO) architecture that outputs all the future frames at one shot naturally breaks the recursive manner and therefore prevents error accumulation. However, only a few MIMO models for video prediction are proposed and they only achieve inferior performance due to the date. The real strength of the MIMO model in this area is not well noticed and is largely under-explored. Motivated by that, we conduct a comprehensive investigation in this paper to thoroughly exploit how far a simple MIMO architecture can go. Surprisingly, our empirical studies reveal that a simple MIMO model can outperform the state-of-the-art work with a large margin much more than expected, especially in dealing with longterm error accumulation. After exploring a number of ways and designs, we propose a new MIMO architecture based on extending the pure Transformer with local spatio-temporal blocks and a new multi-output decoder, namely MIMO-VP, to establish a new standard in video prediction. We evaluate our model in four highly competitive benchmarks (Moving MNIST, Human3.6M, Weather, KITTI). Extensive experiments show that our model wins 1st place on all the benchmarks with remarkable performance gains and surpasses the best SISO model in all aspects including efficiency, quantity, and quality. We believe our model can serve as a new baseline to facilitate the future research of video prediction tasks. The code will be released.
https://arxiv.org/abs/2212.04655