We address the video prediction task by putting forth a novel model that combines (i) our recently proposed hierarchical residual vector quantized variational autoencoder (HR-VQVAE), and (ii) a novel spatiotemporal PixelCNN (ST-PixelCNN). We refer to this approach as a sequential hierarchical residual learning vector quantized variational autoencoder (S-HR-VQVAE). By leveraging the intrinsic capabilities of HR-VQVAE at modeling still images with a parsimonious representation, combined with the ST-PixelCNN's ability at handling spatiotemporal information, S-HR-VQVAE can better deal with chief challenges in video prediction. These include learning spatiotemporal information, handling high dimensional data, combating blurry prediction, and implicit modeling of physical characteristics. Extensive experimental results on the KTH Human Action and Moving-MNIST tasks demonstrate that our model compares favorably against top video prediction techniques both in quantitative and qualitative evaluations despite a much smaller model size. Finally, we boost S-HR-VQVAE by proposing a novel training method to jointly estimate the HR-VQVAE and ST-PixelCNN parameters.
我们解决了视频预测任务,提出了一种新模型,它结合了我们最近提出的分层残留向量量化Variational Autoencoder (HR-VQVAE)和一种新的时间空间像素卷积神经网络 (ST-PixelCNN)。我们称之为顺序分层残留向量量化Variational Autoencoder (S-HR-VQVAE)。通过利用HR-VQVAE在简单表示下建模静态图像的固有能力,并结合ST-PixelCNN在处理时间空间信息方面的能力,S-HR-VQVAE能够更好地处理视频预测的主要挑战。这些挑战包括学习时间空间信息、处理高维数据、对抗模糊的预测和隐含形态学的特征建模。在KTH人类行动和移动MNIST任务的实验结果表明,尽管模型规模较小,但在定量和定性评估中与最先进的视频预测技术进行比较时表现良好。最后,我们提出了一种新训练方法, jointly estimate HR-VQVAE和ST-PixelCNN参数,以提升S-HR-VQVAE。
https://arxiv.org/abs/2307.06701
With the increasing adoption of robots across industries, it is crucial to focus on developing advanced algorithms that enable robots to anticipate, comprehend, and plan their actions effectively in collaboration with humans. We introduce the Robot Autonomous Motion (RoAM) video dataset, which is collected with a custom-made turtlebot3 Burger robot in a variety of indoor environments recording various human motions from the robot's ego-vision. The dataset also includes synchronized records of the LiDAR scan and all control actions taken by the robot as it navigates around static and moving human agents. The unique dataset provides an opportunity to develop and benchmark new visual prediction frameworks that can predict future image frames based on the action taken by the recording agent in partially observable scenarios or cases where the imaging sensor is mounted on a moving platform. We have benchmarked the dataset on our novel deep visual prediction framework called ACPNet where the approximated future image frames are also conditioned on action taken by the robot and demonstrated its potential for incorporating robot dynamics into the video prediction paradigm for mobile robotics and autonomous navigation research.
随着各行各业对机器人的广泛应用,发展先进的算法是至关重要的。我们介绍了机器人自主运动(RoAM)视频数据集,该数据集使用定制的turtlebot3 Burger机器人在多种室内环境中录制从机器人自我意识Vision角度的各种人类运动。数据集还包括同步记录的激光扫描和机器人在静态和动态人类代理周围导航时采取的所有控制行动。该独特的数据集提供了一个机会,开发和基准新的视觉预测框架,可以在可观察场景或图像传感器安装在移动平台上时基于记录设备的采取行动预测未来图像帧。我们基准了该数据集在我们 novel 的深层视觉预测框架 ACPNet 上,其中预测的未来图像帧也取决于机器人采取的行动,并展示了将其机器人动力学纳入移动设备机器人和自主导航研究的视频预测范式的潜力。
https://arxiv.org/abs/2306.15852
General physical scene understanding requires more than simply localizing and recognizing objects -- it requires knowledge that objects can have different latent properties (e.g., mass or elasticity), and that those properties affect the outcome of physical events. While there has been great progress in physical and video prediction models in recent years, benchmarks to test their performance typically do not require an understanding that objects have individual physical properties, or at best test only those properties that are directly observable (e.g., size or color). This work proposes a novel dataset and benchmark, termed Physion++, that rigorously evaluates visual physical prediction in artificial systems under circumstances where those predictions rely on accurate estimates of the latent physical properties of objects in the scene. Specifically, we test scenarios where accurate prediction relies on estimates of properties such as mass, friction, elasticity, and deformability, and where the values of those properties can only be inferred by observing how objects move and interact with other objects or fluids. We evaluate the performance of a number of state-of-the-art prediction models that span a variety of levels of learning vs. built-in knowledge, and compare that performance to a set of human predictions. We find that models that have been trained using standard regimes and datasets do not spontaneously learn to make inferences about latent properties, but also that models that encode objectness and physical states tend to make better predictions. However, there is still a huge gap between all models and human performance, and all models' predictions correlate poorly with those made by humans, suggesting that no state-of-the-art model is learning to make physical predictions in a human-like way. Project page: this https URL
总体物理场景理解需要更多的不仅仅是定位和识别对象,而是要认识到对象可能具有不同的潜在属性(例如质量或弹性),而这些属性会影响物理事件的结果。尽管在物理和视频预测模型方面已经取得了很大进展,但测试其性能的标准通常不需要理解对象具有 individual 物理属性,或者仅测试那些直接观察到的属性(例如尺寸或颜色)。本研究提出了一个全新的数据集和基准,称为Physion++,旨在 rigorous 评估人工系统中的视觉物理预测性能,在这些预测性能依赖于对场景对象潜在属性准确估计的情况下进行测试。具体来说,我们测试了那些依赖于对属性如质量、摩擦、弹性和可变形性等进行估计的场景,并且只能通过观察对象的运动和与其他对象或液体的互动来推断这些属性的值。我们评估了多个最先进的预测模型的性能,涵盖了学习与内置知识的不同水平,并将这些性能与人类预测进行比较。我们发现,使用标准训练方法和数据集训练的模型不会自发地学习关于潜在属性的推断,但编码物体性和物理状态的倾向往往导致更好的预测性能。然而,所有模型与人类表现之间存在巨大的差距,并且所有模型的预测与人类预测之间相关性不佳,这表明没有一个最先进的模型正在学习以人类方式进行物理预测。项目页面: this https URL
https://arxiv.org/abs/2306.15668
In recent years, deep learning-based solar forecasting using all-sky images has emerged as a promising approach for alleviating uncertainty in PV power generation. However, the stochastic nature of cloud movement remains a major challenge for accurate and reliable solar forecasting. With the recent advances in generative artificial intelligence, the synthesis of visually plausible yet diversified sky videos has potential for aiding in forecasts. In this study, we introduce \emph{SkyGPT}, a physics-informed stochastic video prediction model that is able to generate multiple possible future images of the sky with diverse cloud motion patterns, by using past sky image sequences as input. Extensive experiments and comparison with benchmark video prediction models demonstrate the effectiveness of the proposed model in capturing cloud dynamics and generating future sky images with high realism and diversity. Furthermore, we feed the generated future sky images from the video prediction models for 15-minute-ahead probabilistic solar forecasting for a 30-kW roof-top PV system, and compare it with an end-to-end deep learning baseline model SUNSET and a smart persistence model. Better PV output prediction reliability and sharpness is observed by using the predicted sky images generated with SkyGPT compared with other benchmark models, achieving a continuous ranked probability score (CRPS) of 2.81 (13\% better than SUNSET and 23\% better than smart persistence) and a Winkler score of 26.70 for the test set. Although an arbitrary number of futures can be generated from a historical sky image sequence, the results suggest that 10 future scenarios is a good choice that balances probabilistic solar forecasting performance and computational cost.
近年来,使用所有天空图像的深度学习太阳能预测已成为减轻太阳能电池板发电不确定性的有前途的方法。然而,云运动的随机性质仍然是准确可靠的太阳能预测的一个主要挑战。随着生成人工智能的最新进展,通过将具有不同云运动模式的多样化天空视频合成起来,有可能帮助预测。在本研究中,我们介绍了 \emph{SkyGPT} 一个基于物理知识的随机视频预测模型,能够通过使用过去天空图像序列作为输入生成多种具有不同云运动模式的未来的天空图像。进行了广泛的实验并与其他基准视频预测模型进行了比较,证明了该模型在捕获云动态并生成高真实感和多样性的未来天空图像方面的 effectiveness。此外,我们使用视频预测模型从生成的未来天空图像中为一个30千瓦屋顶太阳能电池板进行了15分钟的 probabilistic 太阳能预测,并与 end-to-end 深度学习基线模型 SunSET 和智能坚持模型进行了比较。与其他基准模型相比,使用 SkyGPT 生成的预测天空图像在太阳能电池板输出预测可靠性和清晰度方面表现更好,连续排名概率得分(CRPS)为2.81(比 SunSET 好13\%,比智能坚持模型好23\%),测试集Winkler得分为26.70。虽然可以从历史天空图像序列中生成任意数量的未来的图像,但结果表明,考虑10个未来的场景是平衡 probabilistic 太阳能预测性能和计算成本的好选择。
https://arxiv.org/abs/2306.11682
Video prediction is a pixel-level task that generates future frames by employing the historical frames. There often exist continuous complex motions, such as object overlapping and scene occlusion in video, which poses great challenges to this task. Previous works either fail to well capture the long-term temporal dynamics or do not handle the occlusion masks. To address these issues, we develop the fully convolutional Fast Fourier Inception Networks for video prediction, termed \textit{FFINet}, which includes two primary components, \ie, the occlusion inpainter and the spatiotemporal translator. The former adopts the fast Fourier convolutions to enlarge the receptive field, such that the missing areas (occlusion) with complex geometric structures are filled by the inpainter. The latter employs the stacked Fourier transform inception module to learn the temporal evolution by group convolutions and the spatial movement by channel-wise Fourier convolutions, which captures both the local and the global spatiotemporal features. This encourages generating more realistic and high-quality future frames. To optimize the model, the recovery loss is imposed to the objective, \ie, minimizing the mean square error between the ground-truth frame and the recovery frame. Both quantitative and qualitative experimental results on five benchmarks, including Moving MNIST, TaxiBJ, Human3.6M, Caltech Pedestrian, and KTH, have demonstrated the superiority of the proposed approach. Our code is available at GitHub.
视频预测是一个在像素级别上的任务,通过使用历史帧来生成未来的帧。通常存在连续的复杂的运动,例如视频中的物体重叠和场景遮挡,这给这个任务带来了巨大的挑战。以前的工作要么无法很好地捕捉到长期的时间动态特性,要么没有处理遮挡面具。为了解决这些问题,我们开发了 fully convolutional 的 Fast Fourier Inception Networks 来进行视频预测,称之为 \textit{FFINet},它包括两个主要组成部分,即遮挡涂鸦和时间空间翻译器。前者采用快速傅里叶卷积来扩大接收域,使得缺失的区域(遮挡)由涂鸦填充。后者使用堆叠的傅里叶变换入境模块来学习通过群体卷积和时间空间卷积的时间演化和空间移动,从而捕捉 local 和 global 的时间空间特性。这鼓励生成更加真实和高质量的未来的帧。为了优化模型,恢复损失被强加到目标上,即最小化实际帧和恢复帧之间的平方误差。在五个基准测试中,包括运动 MNIST、 TaxiBJ、人类3.6M、Caltech 步行和 KTH 的量化和定性实验结果都证明了该方法的优越性。我们的代码可在 GitHub 上找到。
https://arxiv.org/abs/2306.10346
We propose a new object-centric video prediction algorithm based on the deep latent particle (DLP) representation. In comparison to existing slot- or patch-based representations, DLPs model the scene using a set of keypoints with learned parameters for properties such as position and size, and are both efficient and interpretable. Our method, deep dynamic latent particles (DDLP), yields state-of-the-art object-centric video prediction results on several challenging datasets. The interpretable nature of DDLP allows us to perform ``what-if'' generation -- predict the consequence of changing properties of objects in the initial frames, and DLP's compact structure enables efficient diffusion-based unconditional video generation. Videos, code and pre-trained models are available: this https URL
我们提出了一种新的基于深度潜在粒子(DLP)表示的物体中心视频预测算法。与现有的基于块或补丁的表示方法相比,DLP使用一组具有学习参数的位置和大小属性的关键点来建模场景,既高效又可解释。我们的算法是深度动态潜在粒子(DDLP),在多个具有挑战性的dataset上取得了物体中心视频预测的最新成果。DDLP可解释性强的特性使我们能够进行“如果”生成,即预测初始帧中物体属性变化的后果,而DLP的紧凑结构实现了高效的扩散式无条件视频生成。视频、代码和预训练模型已可用,以下是httpsURL。
https://arxiv.org/abs/2306.05957
Diffusion models have emerged as a powerful paradigm in video synthesis tasks including prediction, generation, and interpolation. Due to the limitation of the computational budget, existing methods usually implement conditional diffusion models with an autoregressive inference pipeline, in which the future fragment is predicted based on the distribution of adjacent past frames. However, only the conditions from a few previous frames can't capture the global temporal coherence, leading to inconsistent or even outrageous results in long-term video prediction. In this paper, we propose a Local-Global Context guided Video Diffusion model (LGC-VD) to capture multi-perception conditions for producing high-quality videos in both conditional/unconditional settings. In LGC-VD, the UNet is implemented with stacked residual blocks with self-attention units, avoiding the undesirable computational cost in 3D Conv. We construct a local-global context guidance strategy to capture the multi-perceptual embedding of the past fragment to boost the consistency of future prediction. Furthermore, we propose a two-stage training strategy to alleviate the effect of noisy frames for more stable predictions. Our experiments demonstrate that the proposed method achieves favorable performance on video prediction, interpolation, and unconditional video generation. We release code at this https URL.
扩散模型在视频合成任务中已经成为一种强大的范式,包括预测、生成和插值。由于计算预算的限制,现有方法通常使用条件扩散模型并结合自回归推理管道来实现,其中未来片段是根据相邻过去帧的分布预测的。然而,只有前几个相邻帧的条件不能捕捉全局时间一致性,导致长期视频预测结果不一致甚至恶化。在本文中,我们提出了一种Local-Global Context guided Video Diffusion模型(LGC-VD),以捕捉多种感知条件,以在条件/无条件设置下生产高质量的视频。在LGC-VD中,使用堆叠的残留块和注意力单元来实现UNet,避免了3DConv的不想要的计算成本。我们建立了一种Local-Global Context guidance策略,以捕捉过去片段的多种感知嵌入,以增强未来预测的一致性。此外,我们提出了一种两阶段的训练策略,以减轻噪声帧的影响,以更稳定的预测。我们的实验表明,该方法在视频预测、插值和无条件视频生成方面取得了有利的性能。我们在这个httpsURL上发布了代码。
https://arxiv.org/abs/2306.02562
Specifying reward signals that allow agents to learn complex behaviors is a long-standing challenge in reinforcement learning. A promising approach is to extract preferences for behaviors from unlabeled videos, which are widely available on the internet. We present Video Prediction Rewards (VIPER), an algorithm that leverages pretrained video prediction models as action-free reward signals for reinforcement learning. Specifically, we first train an autoregressive transformer on expert videos and then use the video prediction likelihoods as reward signals for a reinforcement learning agent. VIPER enables expert-level control without programmatic task rewards across a wide range of DMC, Atari, and RLBench tasks. Moreover, generalization of the video prediction model allows us to derive rewards for an out-of-distribution environment where no expert data is available, enabling cross-embodiment generalization for tabletop manipulation. We see our work as starting point for scalable reward specification from unlabeled videos that will benefit from the rapid advances in generative modeling. Source code and datasets are available on the project website: this https URL
指定能够让代理学习复杂行为的奖励信号是 reinforcement learning 中的长期挑战。一个有前途的方法是从未标记的视频中提取对行为的偏好,这些视频在互联网上广泛可得。我们提出了视频预测奖励(VIPER),该算法利用训练好的视频预测模型作为 reinforcement learning 中的无行动奖励信号。具体来说,我们首先训练了专家视频的自回归Transformer,然后使用视频预测概率作为奖励信号,为 reinforcement learning 代理提供专家级别的控制,而无需程序任务奖励。此外,视频预测模型的泛化使我们能够在没有专家数据可用的分布外环境中提取奖励,从而实现桌面操作对象的跨身体化身泛化。我们认为我们的工作可以作为从未标记视频 scalable 奖励 specification 的起点。源代码和数据集可在项目网站上找到:这个 https URL。
https://arxiv.org/abs/2305.14343
In this paper, we investigate the challenge of spatio-temporal video prediction, which involves generating future videos based on historical data streams. Existing approaches typically utilize external information such as semantic maps to enhance video prediction, which often neglect the inherent physical knowledge embedded within videos. Furthermore, their high computational demands could impede their applications for high-resolution videos. To address these constraints, we introduce a novel approach called Physics-assisted Spatio-temporal Network (PastNet) for generating high-quality video predictions. The core of our PastNet lies in incorporating a spectral convolution operator in the Fourier domain, which efficiently introduces inductive biases from the underlying physical laws. Additionally, we employ a memory bank with the estimated intrinsic dimensionality to discretize local features during the processing of complex spatio-temporal signals, thereby reducing computational costs and facilitating efficient high-resolution video prediction. Extensive experiments on various widely-used datasets demonstrate the effectiveness and efficiency of the proposed PastNet compared with state-of-the-art methods, particularly in high-resolution scenarios.
在本文中,我们研究时空视频预测的挑战,这涉及基于历史数据流生成未来视频的方法。现有的方法通常使用外部信息,如语义地图,以增强视频预测,但常常忽略了视频中隐含的物理知识。此外,它们的高度计算要求可能会阻碍其对高分辨率视频的应用。为了解决这些限制,我们介绍了一种新的方法,称为物理学辅助时空网络(PastNet),以生成高质量的视频预测。我们的PastNet的核心是在傅里叶域中引入光谱卷积操作,有效地引入基于底层物理规律的转移偏见。此外,我们在处理复杂的时空信号时使用估计 intrinsic dimensionality 的内存银行来离散化局部特征,从而减少计算成本,并促进高效高分辨率视频预测。对多个广泛应用数据集进行广泛的实验表明, proposed pastNet 与最先进的方法相比,特别是在高分辨率场景下,其有效性和效率是有效的。
https://arxiv.org/abs/2305.11421
Object-centric learning aims to represent visual data with a set of object entities (a.k.a. slots), providing structured representations that enable systematic generalization. Leveraging advanced architectures like Transformers, recent approaches have made significant progress in unsupervised object discovery. In addition, slot-based representations hold great potential for generative modeling, such as controllable image generation and object manipulation in image editing. However, current slot-based methods often produce blurry images and distorted objects, exhibiting poor generative modeling capabilities. In this paper, we focus on improving slot-to-image decoding, a crucial aspect for high-quality visual generation. We introduce SlotDiffusion -- an object-centric Latent Diffusion Model (LDM) designed for both image and video data. Thanks to the powerful modeling capacity of LDMs, SlotDiffusion surpasses previous slot models in unsupervised object segmentation and visual generation across six datasets. Furthermore, our learned object features can be utilized by existing object-centric dynamics models, improving video prediction quality and downstream temporal reasoning tasks. Finally, we demonstrate the scalability of SlotDiffusion to unconstrained real-world datasets such as PASCAL VOC and COCO, when integrated with self-supervised pre-trained image encoders.
对象中心学习的目标是使用对象实体(也称为槽)来代表视觉数据,提供结构性表示,从而实现系统性泛化。利用Transformer等高级架构,最近的方法在未监督对象发现方面取得了显著进展。此外,基于槽的表示在生成模型方面具有巨大的潜力,例如在图像编辑中可控制的图像生成和对象操纵。然而,当前基于槽的方法往往产生模糊的图像和扭曲的对象,表现出生成模型能力的不足。在本文中,我们关注改进槽到图像解码,这是高质量视觉生成的关键方面。我们介绍了slotDiffusion——一个针对图像和视频数据的 object-centric Latent Diffusion Model(LDM)。由于LDM的强大建模能力,slotDiffusion在六 datasets 的未监督对象分割和视觉生成方面超越了以前的槽模型。此外,我们学习的对象特征可以由现有的对象中心动态模型使用,提高视频预测质量和后续的时间推理任务。最后,我们展示了slotDiffusion对无约束的现实世界数据集如PASCAL VOC和COCO的 scalability。在与自监督预训练图像编码器集成时,我们证明了slotDiffusion的可扩展性。
https://arxiv.org/abs/2305.11281
Video is a promising source of knowledge for embodied agents to learn models of the world's dynamics. Large deep networks have become increasingly effective at modeling complex video data in a self-supervised manner, as evaluated by metrics based on human perceptual similarity or pixel-wise comparison. However, it remains unclear whether current metrics are accurate indicators of performance on downstream tasks. We find empirically that for planning robotic manipulation, existing metrics can be unreliable at predicting execution success. To address this, we propose a benchmark for action-conditioned video prediction in the form of a control benchmark that evaluates a given model for simulated robotic manipulation through sampling-based planning. Our benchmark, Video Prediction for Visual Planning ($VP^2$), includes simulated environments with 11 task categories and 310 task instance definitions, a full planning implementation, and training datasets containing scripted interaction trajectories for each task category. A central design goal of our benchmark is to expose a simple interface -- a single forward prediction call -- so it is straightforward to evaluate almost any action-conditioned video prediction model. We then leverage our benchmark to study the effects of scaling model size, quantity of training data, and model ensembling by analyzing five highly-performant video prediction models, finding that while scale can improve perceptual quality when modeling visually diverse settings, other attributes such as uncertainty awareness can also aid planning performance.
视频是身体参与 agents 学习世界动态模型的有前途的知识来源。大型深网络在以自我监督方式建模复杂的视频数据方面变得越来越有效,以基于人类感知相似性或像素比较的度量指标进行评估。然而,目前仍不清楚当前度量指标是否准确地反映了下游任务的表现。我们经验证,对于规划机器人操纵,现有的度量指标在预测执行成功方面的可靠性是不可靠的。为了解决这个问题,我们提出了一种控制基准,以作为评估通过采样计划模拟机器人操纵给定模型的基准。我们的基准是“视频预测Visual Planning”(VP2),包括模拟环境,每个任务类别具有11个任务分类和310个任务实例定义,以及完整的计划实施和包含每个任务类别的编程交互轨迹的训练数据集。我们的基准的核心设计目标是暴露一个简单的接口——一个单一的预测向前调用——以便几乎可以直接评估任何行动条件的视频预测模型。然后我们利用我们的基准来分析五家表现优异的视频预测模型,发现虽然规模可以在建模视觉多样性的情境中改善感知质量,但其他属性,如不确定性意识,也可以协助规划性能。
https://arxiv.org/abs/2304.13723
In this paper, we explore the impact of adding tactile sensation to video prediction models for physical robot interactions. Predicting the impact of robotic actions on the environment is a fundamental challenge in robotics. Current methods leverage visual and robot action data to generate video predictions over a given time period, which can then be used to adjust robot actions. However, humans rely on both visual and tactile feedback to develop and maintain a mental model of their physical surroundings. In this paper, we investigate the impact of integrating tactile feedback into video prediction models for physical robot interactions. We propose three multi-modal integration approaches and compare the performance of these tactile-enhanced video prediction models. Additionally, we introduce two new datasets of robot pushing that use a magnetic-based tactile sensor for unsupervised learning. The first dataset contains visually identical objects with different physical properties, while the second dataset mimics existing robot-pushing datasets of household object clusters. Our results demonstrate that incorporating tactile feedback into video prediction models improves scene prediction accuracy and enhances the agent's perception of physical interactions and understanding of cause-effect relationships during physical robot interactions.
在本文中,我们对将触觉感觉添加到视频预测模型中对于实际机器人交互的影响进行了研究。预测机器人行动对环境的影响是机器人学中的一个基本挑战。目前的方法利用视觉和机器人行动数据生成在特定时间段内的视频预测,然后可用于调整机器人行动。然而,人类依赖视觉和触觉反馈来发展和维持他们的物理周围环境的心理模型。在本文中,我们研究了将触觉反馈整合到视频预测模型中对于实际机器人交互的影响。我们提出了三种多模态融合方法,并比较了这些触觉增强的视频预测模型的性能。此外,我们介绍了两个新的机器人推动数据集,这些数据集使用基于磁性的触觉传感器进行无监督学习。第一个数据集包含具有不同物理性质的视觉相同的物体,而第二个数据集模拟了家庭物品簇现有的机器人推动数据集。我们的结果表明,将触觉反馈添加到视频预测模型中可以提高场景预测的准确性,增强Agent在物理机器人交互中对物理互动和因果关系的理解。
https://arxiv.org/abs/2304.11193
The drastic variation of motion in spatial and temporal dimensions makes the video prediction task extremely challenging. Existing RNN models obtain higher performance by deepening or widening the model. They obtain the multi-scale features of the video only by stacking layers, which is inefficient and brings unbearable training costs (such as memory, FLOPs, and training time). Different from them, this paper proposes a spatiotemporal multi-scale model called MS-LSTM wholly from a multi-scale perspective. On the basis of stacked layers, MS-LSTM incorporates two additional efficient multi-scale designs to fully capture spatiotemporal context information. Concretely, we employ LSTMs with mirrored pyramid structures to construct spatial multi-scale representations and LSTMs with different convolution kernels to construct temporal multi-scale representations. Detailed comparison experiments with eight baseline models on four video datasets show that MS-LSTM has better performance but lower training costs.
在空间和时间维度上的变化剧烈使得视频预测任务极其具有挑战性。现有的RNN模型通过加深或拓宽模型来获得更高的性能。他们只能通过堆叠层来获取视频的多尺度特征,这非常低效并且带来了难以承受的训练成本(如内存、Flops、训练时间)。与它们不同,本文提出了一种名为MS-LSTM的时空多尺度模型,完全从多尺度的角度提出。基于堆叠的层,MS-LSTM引入了两个高效的多尺度设计,以完全捕捉时空上下文信息。具体而言,我们使用具有 mirrorPyramid结构的LSTM构建空间多尺度表示,使用不同的卷积核构建时间多尺度表示。在四个视频数据集上与8个基准模型进行详细的比较实验,结果表明MS-LSTM具有更好的性能,但训练成本更低。
https://arxiv.org/abs/2304.07724
We present a novel approach for modeling vegetation response to weather in Europe as measured by the Sentinel 2 satellite. Existing satellite imagery forecasting approaches focus on photorealistic quality of the multispectral images, while derived vegetation dynamics have not yet received as much attention. We leverage both spatial and temporal context by extending state-of-the-art video prediction methods with weather guidance. We extend the EarthNet2021 dataset to be suitable for vegetation modeling by introducing a learned cloud mask and an appropriate evaluation scheme. Qualitative and quantitative experiments demonstrate superior performance of our approach over a wide variety of baseline methods, including leading approaches to satellite imagery forecasting. Additionally, we show how our modeled vegetation dynamics can be leveraged in a downstream task: inferring gross primary productivity for carbon monitoring. To the best of our knowledge, this work presents the first models for continental-scale vegetation modeling at fine resolution able to capture anomalies beyond the seasonal cycle, thereby paving the way for predictive assessments of vegetation status.
我们提出了一种新方法,用于模拟欧洲根据Sentinel 2卫星测量的气象对植被反应。现有的卫星图像预测方法主要关注多光谱图像的逼真质量,但相关的植被动态还未得到足够的关注和重视。我们利用时空 context 的优势,通过引入先进的视频预测方法,结合气象指导,将 EarthNet2021 数据集扩展为适合植被建模,并引入了学习 cloud mask 和适当的评估 scheme。定性和定量实验表明,我们的方法比许多基准方法表现更好,包括卫星图像预测的领先地位。此外,我们展示了我们模型的植被动态如何在后续任务中利用:推断碳监测的 gross primary productivity。据我们所知,这项工作提出了大陆尺度植被建模的高精度模型,能够捕捉到季节之外异常情况,为预测植被状态进行评估开辟了道路。
https://arxiv.org/abs/2303.16198
Imagining the future trajectory is the key for robots to make sound planning and successfully reach their goals. Therefore, text-conditioned video prediction (TVP) is an essential task to facilitate general robot policy learning, i.e., predicting future video frames with a given language instruction and reference frames. It is a highly challenging task to ground task-level goals specified by instructions and high-fidelity frames together, requiring large-scale data and computation. To tackle this task and empower robots with the ability to foresee the future, we propose a sample and computation-efficient model, named \textbf{Seer}, by inflating the pretrained text-to-image (T2I) stable diffusion models along the temporal axis. We inflate the denoising U-Net and language conditioning model with two novel techniques, Autoregressive Spatial-Temporal Attention and Frame Sequential Text Decomposer, to propagate the rich prior knowledge in the pretrained T2I models across the frames. With the well-designed architecture, Seer makes it possible to generate high-fidelity, coherent, and instruction-aligned video frames by fine-tuning a few layers on a small amount of data. The experimental results on Something Something V2 (SSv2) and Bridgedata datasets demonstrate our superior video prediction performance with around 210-hour training on 4 RTX 3090 GPUs: decreasing the FVD of the current SOTA model from 290 to 200 on SSv2 and achieving at least 70\% preference in the human evaluation.
想象未来的轨迹是机器人进行计划的关键,并成功达到其目标。因此,基于文本的图像预测(TVP)是促进一般机器人政策学习的必要的任务,即预测给定语言指令和参考帧的未来视频帧。这是一个高度挑战的任务,以 ground task-level goals 协同指定的高保真度帧的目标,需要大规模数据和计算。为了解决这个问题,并赋予机器人预见未来的能力,我们提议一个样本和计算效率高的模型,名为 \textbf{Seer},通过在时间轴上膨胀预训练的文本到图像稳定扩散模型。我们膨胀去噪 U-Net 和语言 conditioning 模型,使用两个新技术,即自回归的空间和时间注意力和帧序列文本分解,来传播预训练的 T2I 模型中的丰富先验知识,在每个帧上传播。通过精心设计的架构,Seer 使得生成高保真度、同步和指令对齐的视频帧通过微调少量的层在少量数据上fine-tuning能够实现。在 something something V2(SSv2)和Bridgedata 数据集的实验结果证明了我们在大约 210 小时训练的 4 RTX 3090 GPU 上表现出卓越的视频预测性能:将当前领先的模型 FVD 从 290 降低到 200 在SSv2 上,并在人类评估中实现至少 70\% 的偏好。
https://arxiv.org/abs/2303.14897
The performance of video prediction has been greatly boosted by advanced deep neural networks. However, most of the current methods suffer from large model sizes and require extra inputs, e.g., semantic/depth maps, for promising performance. For efficiency consideration, in this paper, we propose a Dynamic Multi-scale Voxel Flow Network (DMVFN) to achieve better video prediction performance at lower computational costs with only RGB images, than previous methods. The core of our DMVFN is a differentiable routing module that can effectively perceive the motion scales of video frames. Once trained, our DMVFN selects adaptive sub-networks for different inputs at the inference stage. Experiments on several benchmarks demonstrate that our DMVFN is an order of magnitude faster than Deep Voxel Flow and surpasses the state-of-the-art iterative-based OPT on generated image quality. Our code and demo are available at this https URL.
视频预测的性能已经得到了高级深度学习网络的大大提高。然而,当前的方法大多数都面临着大型模型大小的问题,并需要额外的输入,例如语义/深度地图,以表现出良好的性能。为了考虑效率,在本文中,我们提出了一种动态多尺度 Voxel 流网络(DMVFN),可以在仅使用RGB图像的情况下,比先前方法实现更好的视频预测性能,而代价更低的计算成本。我们 DMVFN 的核心是一种可区分的路由模块,可以有效地感知视频帧的运动尺度。一旦训练完成,我们的 DMVFN 在推理阶段选择自适应子网络,以不同的输入。对多个基准测试对象的实验表明,我们的 DMVFN 比深度 Voxel 流更快,并且在生成图像质量方面超越了最先进的迭代基于优化方法。我们的代码和演示可以在这个 https URL 上找到。
https://arxiv.org/abs/2303.09875
Video prediction is a complex time-series forecasting task with great potential in many use cases. However, conventional methods overemphasize accuracy while ignoring the slow prediction speed caused by complicated model structures that learn too much redundant information with excessive GPU memory consumption. Furthermore, conventional methods mostly predict frames sequentially (frame-by-frame) and thus are hard to accelerate. Consequently, valuable use cases such as real-time danger prediction and warning cannot achieve fast enough inference speed to be applicable in reality. Therefore, we propose a transformer-based keypoint prediction neural network (TKN), an unsupervised learning method that boost the prediction process via constrained information extraction and parallel prediction scheme. TKN is the first real-time video prediction solution to our best knowledge, while significantly reducing computation costs and maintaining other performance. Extensive experiments on KTH and Human3.6 datasets demonstrate that TKN predicts 11 times faster than existing methods while reducing memory consumption by 17.4% and achieving state-of-the-art prediction performance on average.
视频预测是一种在许多使用场景中具有巨大潜力的复杂的时间序列预测任务。然而,传统的方法过于强调准确性,而忽视了由于复杂的模型结构导致学习过多的冗余信息以及GPU内存消耗过高所带来的缓慢预测速度。此外,传统的方法大多顺序预测帧(帧一帧),因此很难加速。因此,我们提出了基于Transformer的关键帧预测神经网络(TKN),一种无监督学习方法,通过限制信息提取和并行预测方案来增强预测过程。TKN是我们所知的实时视频预测解决方案的第一个方法,同时显著降低计算成本并维持其他性能。在KTH和Human3.6数据集上进行广泛的实验表明,TKN预测速度比现有方法快11倍,同时减少了内存消耗,平均实现最先进的预测性能。
https://arxiv.org/abs/2303.09807
Future frame prediction has been approached through two primary methods: autoregressive and non-autoregressive. Autoregressive methods rely on the Markov assumption and can achieve high accuracy in the early stages of prediction when errors are not yet accumulated. However, their performance tends to decline as the number of time steps increases. In contrast, non-autoregressive methods can achieve relatively high performance but lack correlation between predictions for each time step. In this paper, we propose an Implicit Stacked Autoregressive Model for Video Prediction (IAM4VP), which is an implicit video prediction model that applies a stacked autoregressive method. Like non-autoregressive methods, stacked autoregressive methods use the same observed frame to estimate all future frames. However, they use their own predictions as input, similar to autoregressive methods. As the number of time steps increases, predictions are sequentially stacked in the queue. To evaluate the effectiveness of IAM4VP, we conducted experiments on three common future frame prediction benchmark datasets and weather\&climate prediction benchmark datasets. The results demonstrate that our proposed model achieves state-of-the-art performance.
未来的帧预测已经通过两个主要方法:自回归和非线性非自回归方法。自回归方法依赖于马尔可夫假设,并在预测的早期阶段,当错误尚未累积时,可以实现高精度。然而,他们的性能随着时间步数的增加而倾向于下降。相比之下,非线性非自回归方法可以实现较高的性能,但它们在每个时间步之间的预测之间缺乏相关性。在本文中,我们提出了一种隐含的堆叠自回归模型,即视频预测隐含的堆叠自回归模型(IAM4VP),这是一种应用堆叠自回归方法的视频预测模型。与非线性非自回归方法一样,堆叠自回归方法使用相同的观察帧来估计所有未来的帧。然而,它们使用自己的预测作为输入,类似于自回归方法。随着时间步数的增加,预测依次堆叠在队列中。为了评估IAM4VP的有效性,我们进行了三种常见的未来帧预测基准数据和天气\&气候预测基准数据的实验。结果表明,我们提出的模型实现了先进的性能。
https://arxiv.org/abs/2303.07849
Learning physical dynamics in a series of non-stationary environments is a challenging but essential task for model-based reinforcement learning (MBRL) with visual inputs. It requires the agent to consistently adapt to novel tasks without forgetting previous knowledge. In this paper, we present a new continual learning approach for visual dynamics modeling and explore its efficacy in visual control and forecasting. The key assumption is that an ideal world model can provide a non-forgetting environment simulator, which enables the agent to optimize the policy in a multi-task learning manner based on the imagined trajectories from the world model. To this end, we first propose the mixture world model that learns task-specific dynamics priors with a mixture of Gaussians, and then introduce a new training strategy to overcome catastrophic forgetting, which we call predictive experience replay. Finally, we extend these methods to continual RL and further address the value estimation problems with the exploratory-conservative behavior learning approach. Our model remarkably outperforms the naive combinations of existing continual learning and visual RL algorithms on DeepMind Control and Meta-World benchmarks with continual visual control tasks. It is also shown to effectively alleviate the forgetting of spatiotemporal dynamics in video prediction datasets with evolving domains.
在学习一系列非稳定环境的物理动态特性是一项具有挑战性但是必不可少的任务,这对于使用视觉输入的模型驱动强化学习(MBRL)来说尤其如此。该任务要求代理持续适应新任务而不会忘记先前的知识。在本文中,我们提出了一种新的视觉动态建模方法,并探索了它在视觉控制和预测方面的效力。关键假设是理想的世界模型可以提供一种不会遗忘的环境模拟器,从而使代理能够在世界模型的想象轨迹上优化政策,以多任务学习的方式。为此,我们提出了一种混合世界模型,通过学习任务特定的动态先验分布,使用高斯混合模型来学习,然后引入了一种新的训练策略,以克服灾难性的遗忘,我们称之为预测经验回放。最后,我们将这些方法扩展到持续强化学习,并进一步解决了探索性保守行为学习方法所带来的价值估计问题。我们的模型在DeepMind控制和Meta-World基准任务中与现有的视觉强化学习和视觉控制算法的盲目组合相比表现出卓越的性能。它还表明,能够在具有进化域的视频预测数据集上有效地减轻忘记时序动态的问题。
https://arxiv.org/abs/2303.06572
Motion, scene and object are three primary visual components of a video. In particular, objects represent the foreground, scenes represent the background, and motion traces their dynamics. Based on this insight, we propose a two-stage MOtion, Scene and Object decomposition framework (MOSO) for video prediction, consisting of MOSO-VQVAE and MOSO-Transformer. In the first stage, MOSO-VQVAE decomposes a previous video clip into the motion, scene and object components, and represents them as distinct groups of discrete tokens. Then, in the second stage, MOSO-Transformer predicts the object and scene tokens of the subsequent video clip based on the previous tokens and adds dynamic motion at the token level to the generated object and scene tokens. Our framework can be easily extended to unconditional video generation and video frame interpolation tasks. Experimental results demonstrate that our method achieves new state-of-the-art performance on five challenging benchmarks for video prediction and unconditional video generation: BAIR, RoboNet, KTH, KITTI and UCF101. In addition, MOSO can produce realistic videos by combining objects and scenes from different videos.
运动、场景和对象是视频的三个主要视觉组成部分。特别是,对象代表前景,场景代表背景,而运动则记录了它们的动态。基于这一见解,我们提出了一个两阶段的运动、场景和对象分解框架(MOSO),由MOSO-VQVAE和MOSO-Transformer组成。在第一阶段,MOSO-VQVAE将先前的视频片段分解为运动、场景和对象组成部分,并将它们表示为独立的离散代币群组。然后在第二阶段,MOSO-Transformer基于先前代币预测后续视频片段中的物体和场景代币,并在代币级别上添加动态运动。我们的框架可以轻松扩展到无条件视频生成和视频帧插值任务。实验结果显示,我们的方法和在视频预测和无条件视频生成五个挑战基准上的新高性能:BAIR、RoboNet、KTH、KITTI和UCF101。此外,MOSO可以通过结合来自不同视频的对象和场景来产生真实的视频。
https://arxiv.org/abs/2303.03684