Connected autonomous vehicles (CAVs) promise to enhance safety, efficiency, and sustainability in urban transportation. However, this is contingent upon a CAV correctly predicting the motion of surrounding agents and planning its own motion safely. Doing so is challenging in complex urban environments due to frequent occlusions and interactions among many agents. One solution is to leverage smart infrastructure to augment a CAV's situational awareness; the present work leverages a recently proposed "Self-Supervised Traffic Advisor" (SSTA) framework of smart sensors that teach themselves to generate and broadcast useful video predictions of road users. In this work, SSTA predictions are modified to predict future occupancy instead of raw video, which reduces the data footprint of broadcast predictions. The resulting predictions are used within a planning framework, demonstrating that this design can effectively aid CAV motion planning. A variety of numerical experiments study the key factors that make SSTA outputs useful for practical CAV planning in crowded urban environments.
联网自主车辆(CAV)承诺提高城市交通的安全性、效率和可持续性。然而,这需要CAV准确预测周围车辆的移动并安全地规划自己的移动。在复杂的城市环境中,由于许多车辆的频繁阻挡和交互,这样做很困难。一种解决方案是利用智能基础设施来提高CAV的环境意识;目前的研究利用最近提出的智能传感器提出的“自主交通顾问”(SSTA)框架,这些传感器自我学习生成和广播道路使用者有用的视频预测。在本研究中,SSTA的预测被修改,以预测未来的使用情况,而不是原始视频,从而减少了广播预测的数据足迹。结果的预测被用于规划框架中,这表明这种方法可以有效地帮助CAV移动规划,在拥挤的城市环境中提高CAV的使用效率。多种数值实验研究了使SSTA输出在拥挤城市环境中对CAV规划有用的关键因素。
https://arxiv.org/abs/2309.07504
A central challenge of video prediction lies where the system has to reason the objects' future motions from image frames while simultaneously maintaining the consistency of their appearances across frames. This work introduces an end-to-end trainable two-stream video prediction framework, Motion-Matrix-based Video Prediction (MMVP), to tackle this challenge. Unlike previous methods that usually handle motion prediction and appearance maintenance within the same set of modules, MMVP decouples motion and appearance information by constructing appearance-agnostic motion matrices. The motion matrices represent the temporal similarity of each and every pair of feature patches in the input frames, and are the sole input of the motion prediction module in MMVP. This design improves video prediction in both accuracy and efficiency, and reduces the model size. Results of extensive experiments demonstrate that MMVP outperforms state-of-the-art systems on public data sets by non-negligible large margins (about 1 db in PSNR, UCF Sports) in significantly smaller model sizes (84% the size or smaller). Please refer to this https URL for the official code and the datasets used in this paper.
视频预测的一个核心挑战是在同时保持图像帧之间外观一致性的情况下,从图像帧中推理物体的未来运动。这项工作介绍了一种端到端可训练的双向视频预测框架,即运动矩阵视频预测(MMVP),以解决这个挑战。与以前的方法通常在同一模块内处理运动预测和外观维护不同,MMVP通过构建外观无关的运动矩阵将运动和外观信息分离。运动矩阵代表输入帧中每个特征 patch 之间的时间相似性,是MMVP 运动预测模块的唯一输入。这个设计提高了视频预测的准确性和效率,并减小了模型大小。广泛的实验结果表明,MMVP在公共数据集上比最先进的系统在小模型规模下显著表现更好(PSNR:约1 db,UCF Sports:84%大小或更小)。请参考此httpsURL以查看官方代码和本文使用的数据集。
https://arxiv.org/abs/2308.16154
Video anomaly detection (VAD) is an important but challenging task in computer vision. The main challenge rises due to the rarity of training samples to model all anomaly cases. Hence, semi-supervised anomaly detection methods have gotten more attention, since they focus on modeling normals and they detect anomalies by measuring the deviations from normal patterns. Despite impressive advances of these methods in modeling normal motion and appearance, long-term motion modeling has not been effectively explored so far. Inspired by the abilities of the future frame prediction proxy-task, we introduce the task of future video prediction from a single frame, as a novel proxy-task for video anomaly detection. This proxy-task alleviates the challenges of previous methods in learning longer motion patterns. Moreover, we replace the initial and future raw frames with their corresponding semantic segmentation map, which not only makes the method aware of object class but also makes the prediction task less complex for the model. Extensive experiments on the benchmark datasets (ShanghaiTech, UCSD-Ped1, and UCSD-Ped2) show the effectiveness of the method and the superiority of its performance compared to SOTA prediction-based VAD methods.
视频异常检测(VAD)是计算机视觉中一个重要的但具有挑战性的任务。其主要挑战源于训练样本不足以模型所有异常案例的罕见性。因此,半监督异常检测方法越来越受到关注,因为它们专注于建模正常情况,并通过测量与正常模式 Deviations 的差异来检测异常。尽管这些方法在建模正常运动和外观方面取得了令人印象深刻的进步,但到目前为止,长期运动建模还没有得到 effectively 的探索。受到未来帧预测代理任务的能力启发,我们引入了从单个帧预测未来视频的任务,并将其作为视频异常检测中的新型代理任务。这个代理任务可以减轻以前方法在学习更长的运动模式方面的挑战。此外,我们替换了初始和未来的 raw 帧及其相应的语义分割地图,这不仅使方法能够识别物体类别,还使模型的预测任务变得更加简单。在基准数据集(ShanghaiTech、UCCSD-Ped1 和 UCSD-Ped2)上进行广泛的实验表明,这种方法的有效性和与 SOTA 基于预测的 VAD 方法相比的性能优越性。
https://arxiv.org/abs/2308.07783
We address the video prediction task by putting forth a novel model that combines (i) our recently proposed hierarchical residual vector quantized variational autoencoder (HR-VQVAE), and (ii) a novel spatiotemporal PixelCNN (ST-PixelCNN). We refer to this approach as a sequential hierarchical residual learning vector quantized variational autoencoder (S-HR-VQVAE). By leveraging the intrinsic capabilities of HR-VQVAE at modeling still images with a parsimonious representation, combined with the ST-PixelCNN's ability at handling spatiotemporal information, S-HR-VQVAE can better deal with chief challenges in video prediction. These include learning spatiotemporal information, handling high dimensional data, combating blurry prediction, and implicit modeling of physical characteristics. Extensive experimental results on the KTH Human Action and Moving-MNIST tasks demonstrate that our model compares favorably against top video prediction techniques both in quantitative and qualitative evaluations despite a much smaller model size. Finally, we boost S-HR-VQVAE by proposing a novel training method to jointly estimate the HR-VQVAE and ST-PixelCNN parameters.
我们解决了视频预测任务,提出了一种新模型,它结合了我们最近提出的分层残留向量量化Variational Autoencoder (HR-VQVAE)和一种新的时间空间像素卷积神经网络 (ST-PixelCNN)。我们称之为顺序分层残留向量量化Variational Autoencoder (S-HR-VQVAE)。通过利用HR-VQVAE在简单表示下建模静态图像的固有能力,并结合ST-PixelCNN在处理时间空间信息方面的能力,S-HR-VQVAE能够更好地处理视频预测的主要挑战。这些挑战包括学习时间空间信息、处理高维数据、对抗模糊的预测和隐含形态学的特征建模。在KTH人类行动和移动MNIST任务的实验结果表明,尽管模型规模较小,但在定量和定性评估中与最先进的视频预测技术进行比较时表现良好。最后,我们提出了一种新训练方法, jointly estimate HR-VQVAE和ST-PixelCNN参数,以提升S-HR-VQVAE。
https://arxiv.org/abs/2307.06701
With the increasing adoption of robots across industries, it is crucial to focus on developing advanced algorithms that enable robots to anticipate, comprehend, and plan their actions effectively in collaboration with humans. We introduce the Robot Autonomous Motion (RoAM) video dataset, which is collected with a custom-made turtlebot3 Burger robot in a variety of indoor environments recording various human motions from the robot's ego-vision. The dataset also includes synchronized records of the LiDAR scan and all control actions taken by the robot as it navigates around static and moving human agents. The unique dataset provides an opportunity to develop and benchmark new visual prediction frameworks that can predict future image frames based on the action taken by the recording agent in partially observable scenarios or cases where the imaging sensor is mounted on a moving platform. We have benchmarked the dataset on our novel deep visual prediction framework called ACPNet where the approximated future image frames are also conditioned on action taken by the robot and demonstrated its potential for incorporating robot dynamics into the video prediction paradigm for mobile robotics and autonomous navigation research.
随着各行各业对机器人的广泛应用,发展先进的算法是至关重要的。我们介绍了机器人自主运动(RoAM)视频数据集,该数据集使用定制的turtlebot3 Burger机器人在多种室内环境中录制从机器人自我意识Vision角度的各种人类运动。数据集还包括同步记录的激光扫描和机器人在静态和动态人类代理周围导航时采取的所有控制行动。该独特的数据集提供了一个机会,开发和基准新的视觉预测框架,可以在可观察场景或图像传感器安装在移动平台上时基于记录设备的采取行动预测未来图像帧。我们基准了该数据集在我们 novel 的深层视觉预测框架 ACPNet 上,其中预测的未来图像帧也取决于机器人采取的行动,并展示了将其机器人动力学纳入移动设备机器人和自主导航研究的视频预测范式的潜力。
https://arxiv.org/abs/2306.15852
General physical scene understanding requires more than simply localizing and recognizing objects -- it requires knowledge that objects can have different latent properties (e.g., mass or elasticity), and that those properties affect the outcome of physical events. While there has been great progress in physical and video prediction models in recent years, benchmarks to test their performance typically do not require an understanding that objects have individual physical properties, or at best test only those properties that are directly observable (e.g., size or color). This work proposes a novel dataset and benchmark, termed Physion++, that rigorously evaluates visual physical prediction in artificial systems under circumstances where those predictions rely on accurate estimates of the latent physical properties of objects in the scene. Specifically, we test scenarios where accurate prediction relies on estimates of properties such as mass, friction, elasticity, and deformability, and where the values of those properties can only be inferred by observing how objects move and interact with other objects or fluids. We evaluate the performance of a number of state-of-the-art prediction models that span a variety of levels of learning vs. built-in knowledge, and compare that performance to a set of human predictions. We find that models that have been trained using standard regimes and datasets do not spontaneously learn to make inferences about latent properties, but also that models that encode objectness and physical states tend to make better predictions. However, there is still a huge gap between all models and human performance, and all models' predictions correlate poorly with those made by humans, suggesting that no state-of-the-art model is learning to make physical predictions in a human-like way. Project page: this https URL
总体物理场景理解需要更多的不仅仅是定位和识别对象,而是要认识到对象可能具有不同的潜在属性(例如质量或弹性),而这些属性会影响物理事件的结果。尽管在物理和视频预测模型方面已经取得了很大进展,但测试其性能的标准通常不需要理解对象具有 individual 物理属性,或者仅测试那些直接观察到的属性(例如尺寸或颜色)。本研究提出了一个全新的数据集和基准,称为Physion++,旨在 rigorous 评估人工系统中的视觉物理预测性能,在这些预测性能依赖于对场景对象潜在属性准确估计的情况下进行测试。具体来说,我们测试了那些依赖于对属性如质量、摩擦、弹性和可变形性等进行估计的场景,并且只能通过观察对象的运动和与其他对象或液体的互动来推断这些属性的值。我们评估了多个最先进的预测模型的性能,涵盖了学习与内置知识的不同水平,并将这些性能与人类预测进行比较。我们发现,使用标准训练方法和数据集训练的模型不会自发地学习关于潜在属性的推断,但编码物体性和物理状态的倾向往往导致更好的预测性能。然而,所有模型与人类表现之间存在巨大的差距,并且所有模型的预测与人类预测之间相关性不佳,这表明没有一个最先进的模型正在学习以人类方式进行物理预测。项目页面: this https URL
https://arxiv.org/abs/2306.15668
In recent years, deep learning-based solar forecasting using all-sky images has emerged as a promising approach for alleviating uncertainty in PV power generation. However, the stochastic nature of cloud movement remains a major challenge for accurate and reliable solar forecasting. With the recent advances in generative artificial intelligence, the synthesis of visually plausible yet diversified sky videos has potential for aiding in forecasts. In this study, we introduce \emph{SkyGPT}, a physics-informed stochastic video prediction model that is able to generate multiple possible future images of the sky with diverse cloud motion patterns, by using past sky image sequences as input. Extensive experiments and comparison with benchmark video prediction models demonstrate the effectiveness of the proposed model in capturing cloud dynamics and generating future sky images with high realism and diversity. Furthermore, we feed the generated future sky images from the video prediction models for 15-minute-ahead probabilistic solar forecasting for a 30-kW roof-top PV system, and compare it with an end-to-end deep learning baseline model SUNSET and a smart persistence model. Better PV output prediction reliability and sharpness is observed by using the predicted sky images generated with SkyGPT compared with other benchmark models, achieving a continuous ranked probability score (CRPS) of 2.81 (13\% better than SUNSET and 23\% better than smart persistence) and a Winkler score of 26.70 for the test set. Although an arbitrary number of futures can be generated from a historical sky image sequence, the results suggest that 10 future scenarios is a good choice that balances probabilistic solar forecasting performance and computational cost.
近年来,使用所有天空图像的深度学习太阳能预测已成为减轻太阳能电池板发电不确定性的有前途的方法。然而,云运动的随机性质仍然是准确可靠的太阳能预测的一个主要挑战。随着生成人工智能的最新进展,通过将具有不同云运动模式的多样化天空视频合成起来,有可能帮助预测。在本研究中,我们介绍了 \emph{SkyGPT} 一个基于物理知识的随机视频预测模型,能够通过使用过去天空图像序列作为输入生成多种具有不同云运动模式的未来的天空图像。进行了广泛的实验并与其他基准视频预测模型进行了比较,证明了该模型在捕获云动态并生成高真实感和多样性的未来天空图像方面的 effectiveness。此外,我们使用视频预测模型从生成的未来天空图像中为一个30千瓦屋顶太阳能电池板进行了15分钟的 probabilistic 太阳能预测,并与 end-to-end 深度学习基线模型 SunSET 和智能坚持模型进行了比较。与其他基准模型相比,使用 SkyGPT 生成的预测天空图像在太阳能电池板输出预测可靠性和清晰度方面表现更好,连续排名概率得分(CRPS)为2.81(比 SunSET 好13\%,比智能坚持模型好23\%),测试集Winkler得分为26.70。虽然可以从历史天空图像序列中生成任意数量的未来的图像,但结果表明,考虑10个未来的场景是平衡 probabilistic 太阳能预测性能和计算成本的好选择。
https://arxiv.org/abs/2306.11682
Video prediction is a pixel-level task that generates future frames by employing the historical frames. There often exist continuous complex motions, such as object overlapping and scene occlusion in video, which poses great challenges to this task. Previous works either fail to well capture the long-term temporal dynamics or do not handle the occlusion masks. To address these issues, we develop the fully convolutional Fast Fourier Inception Networks for video prediction, termed \textit{FFINet}, which includes two primary components, \ie, the occlusion inpainter and the spatiotemporal translator. The former adopts the fast Fourier convolutions to enlarge the receptive field, such that the missing areas (occlusion) with complex geometric structures are filled by the inpainter. The latter employs the stacked Fourier transform inception module to learn the temporal evolution by group convolutions and the spatial movement by channel-wise Fourier convolutions, which captures both the local and the global spatiotemporal features. This encourages generating more realistic and high-quality future frames. To optimize the model, the recovery loss is imposed to the objective, \ie, minimizing the mean square error between the ground-truth frame and the recovery frame. Both quantitative and qualitative experimental results on five benchmarks, including Moving MNIST, TaxiBJ, Human3.6M, Caltech Pedestrian, and KTH, have demonstrated the superiority of the proposed approach. Our code is available at GitHub.
视频预测是一个在像素级别上的任务,通过使用历史帧来生成未来的帧。通常存在连续的复杂的运动,例如视频中的物体重叠和场景遮挡,这给这个任务带来了巨大的挑战。以前的工作要么无法很好地捕捉到长期的时间动态特性,要么没有处理遮挡面具。为了解决这些问题,我们开发了 fully convolutional 的 Fast Fourier Inception Networks 来进行视频预测,称之为 \textit{FFINet},它包括两个主要组成部分,即遮挡涂鸦和时间空间翻译器。前者采用快速傅里叶卷积来扩大接收域,使得缺失的区域(遮挡)由涂鸦填充。后者使用堆叠的傅里叶变换入境模块来学习通过群体卷积和时间空间卷积的时间演化和空间移动,从而捕捉 local 和 global 的时间空间特性。这鼓励生成更加真实和高质量的未来的帧。为了优化模型,恢复损失被强加到目标上,即最小化实际帧和恢复帧之间的平方误差。在五个基准测试中,包括运动 MNIST、 TaxiBJ、人类3.6M、Caltech 步行和 KTH 的量化和定性实验结果都证明了该方法的优越性。我们的代码可在 GitHub 上找到。
https://arxiv.org/abs/2306.10346
We propose a new object-centric video prediction algorithm based on the deep latent particle (DLP) representation. In comparison to existing slot- or patch-based representations, DLPs model the scene using a set of keypoints with learned parameters for properties such as position and size, and are both efficient and interpretable. Our method, deep dynamic latent particles (DDLP), yields state-of-the-art object-centric video prediction results on several challenging datasets. The interpretable nature of DDLP allows us to perform ``what-if'' generation -- predict the consequence of changing properties of objects in the initial frames, and DLP's compact structure enables efficient diffusion-based unconditional video generation. Videos, code and pre-trained models are available: this https URL
我们提出了一种新的基于深度潜在粒子(DLP)表示的物体中心视频预测算法。与现有的基于块或补丁的表示方法相比,DLP使用一组具有学习参数的位置和大小属性的关键点来建模场景,既高效又可解释。我们的算法是深度动态潜在粒子(DDLP),在多个具有挑战性的dataset上取得了物体中心视频预测的最新成果。DDLP可解释性强的特性使我们能够进行“如果”生成,即预测初始帧中物体属性变化的后果,而DLP的紧凑结构实现了高效的扩散式无条件视频生成。视频、代码和预训练模型已可用,以下是httpsURL。
https://arxiv.org/abs/2306.05957
Diffusion models have emerged as a powerful paradigm in video synthesis tasks including prediction, generation, and interpolation. Due to the limitation of the computational budget, existing methods usually implement conditional diffusion models with an autoregressive inference pipeline, in which the future fragment is predicted based on the distribution of adjacent past frames. However, only the conditions from a few previous frames can't capture the global temporal coherence, leading to inconsistent or even outrageous results in long-term video prediction. In this paper, we propose a Local-Global Context guided Video Diffusion model (LGC-VD) to capture multi-perception conditions for producing high-quality videos in both conditional/unconditional settings. In LGC-VD, the UNet is implemented with stacked residual blocks with self-attention units, avoiding the undesirable computational cost in 3D Conv. We construct a local-global context guidance strategy to capture the multi-perceptual embedding of the past fragment to boost the consistency of future prediction. Furthermore, we propose a two-stage training strategy to alleviate the effect of noisy frames for more stable predictions. Our experiments demonstrate that the proposed method achieves favorable performance on video prediction, interpolation, and unconditional video generation. We release code at this https URL.
扩散模型在视频合成任务中已经成为一种强大的范式,包括预测、生成和插值。由于计算预算的限制,现有方法通常使用条件扩散模型并结合自回归推理管道来实现,其中未来片段是根据相邻过去帧的分布预测的。然而,只有前几个相邻帧的条件不能捕捉全局时间一致性,导致长期视频预测结果不一致甚至恶化。在本文中,我们提出了一种Local-Global Context guided Video Diffusion模型(LGC-VD),以捕捉多种感知条件,以在条件/无条件设置下生产高质量的视频。在LGC-VD中,使用堆叠的残留块和注意力单元来实现UNet,避免了3DConv的不想要的计算成本。我们建立了一种Local-Global Context guidance策略,以捕捉过去片段的多种感知嵌入,以增强未来预测的一致性。此外,我们提出了一种两阶段的训练策略,以减轻噪声帧的影响,以更稳定的预测。我们的实验表明,该方法在视频预测、插值和无条件视频生成方面取得了有利的性能。我们在这个httpsURL上发布了代码。
https://arxiv.org/abs/2306.02562
Specifying reward signals that allow agents to learn complex behaviors is a long-standing challenge in reinforcement learning. A promising approach is to extract preferences for behaviors from unlabeled videos, which are widely available on the internet. We present Video Prediction Rewards (VIPER), an algorithm that leverages pretrained video prediction models as action-free reward signals for reinforcement learning. Specifically, we first train an autoregressive transformer on expert videos and then use the video prediction likelihoods as reward signals for a reinforcement learning agent. VIPER enables expert-level control without programmatic task rewards across a wide range of DMC, Atari, and RLBench tasks. Moreover, generalization of the video prediction model allows us to derive rewards for an out-of-distribution environment where no expert data is available, enabling cross-embodiment generalization for tabletop manipulation. We see our work as starting point for scalable reward specification from unlabeled videos that will benefit from the rapid advances in generative modeling. Source code and datasets are available on the project website: this https URL
指定能够让代理学习复杂行为的奖励信号是 reinforcement learning 中的长期挑战。一个有前途的方法是从未标记的视频中提取对行为的偏好,这些视频在互联网上广泛可得。我们提出了视频预测奖励(VIPER),该算法利用训练好的视频预测模型作为 reinforcement learning 中的无行动奖励信号。具体来说,我们首先训练了专家视频的自回归Transformer,然后使用视频预测概率作为奖励信号,为 reinforcement learning 代理提供专家级别的控制,而无需程序任务奖励。此外,视频预测模型的泛化使我们能够在没有专家数据可用的分布外环境中提取奖励,从而实现桌面操作对象的跨身体化身泛化。我们认为我们的工作可以作为从未标记视频 scalable 奖励 specification 的起点。源代码和数据集可在项目网站上找到:这个 https URL。
https://arxiv.org/abs/2305.14343
In this paper, we investigate the challenge of spatio-temporal video prediction, which involves generating future videos based on historical data streams. Existing approaches typically utilize external information such as semantic maps to enhance video prediction, which often neglect the inherent physical knowledge embedded within videos. Furthermore, their high computational demands could impede their applications for high-resolution videos. To address these constraints, we introduce a novel approach called Physics-assisted Spatio-temporal Network (PastNet) for generating high-quality video predictions. The core of our PastNet lies in incorporating a spectral convolution operator in the Fourier domain, which efficiently introduces inductive biases from the underlying physical laws. Additionally, we employ a memory bank with the estimated intrinsic dimensionality to discretize local features during the processing of complex spatio-temporal signals, thereby reducing computational costs and facilitating efficient high-resolution video prediction. Extensive experiments on various widely-used datasets demonstrate the effectiveness and efficiency of the proposed PastNet compared with state-of-the-art methods, particularly in high-resolution scenarios.
在本文中,我们研究时空视频预测的挑战,这涉及基于历史数据流生成未来视频的方法。现有的方法通常使用外部信息,如语义地图,以增强视频预测,但常常忽略了视频中隐含的物理知识。此外,它们的高度计算要求可能会阻碍其对高分辨率视频的应用。为了解决这些限制,我们介绍了一种新的方法,称为物理学辅助时空网络(PastNet),以生成高质量的视频预测。我们的PastNet的核心是在傅里叶域中引入光谱卷积操作,有效地引入基于底层物理规律的转移偏见。此外,我们在处理复杂的时空信号时使用估计 intrinsic dimensionality 的内存银行来离散化局部特征,从而减少计算成本,并促进高效高分辨率视频预测。对多个广泛应用数据集进行广泛的实验表明, proposed pastNet 与最先进的方法相比,特别是在高分辨率场景下,其有效性和效率是有效的。
https://arxiv.org/abs/2305.11421
Object-centric learning aims to represent visual data with a set of object entities (a.k.a. slots), providing structured representations that enable systematic generalization. Leveraging advanced architectures like Transformers, recent approaches have made significant progress in unsupervised object discovery. In addition, slot-based representations hold great potential for generative modeling, such as controllable image generation and object manipulation in image editing. However, current slot-based methods often produce blurry images and distorted objects, exhibiting poor generative modeling capabilities. In this paper, we focus on improving slot-to-image decoding, a crucial aspect for high-quality visual generation. We introduce SlotDiffusion -- an object-centric Latent Diffusion Model (LDM) designed for both image and video data. Thanks to the powerful modeling capacity of LDMs, SlotDiffusion surpasses previous slot models in unsupervised object segmentation and visual generation across six datasets. Furthermore, our learned object features can be utilized by existing object-centric dynamics models, improving video prediction quality and downstream temporal reasoning tasks. Finally, we demonstrate the scalability of SlotDiffusion to unconstrained real-world datasets such as PASCAL VOC and COCO, when integrated with self-supervised pre-trained image encoders.
对象中心学习的目标是使用对象实体(也称为槽)来代表视觉数据,提供结构性表示,从而实现系统性泛化。利用Transformer等高级架构,最近的方法在未监督对象发现方面取得了显著进展。此外,基于槽的表示在生成模型方面具有巨大的潜力,例如在图像编辑中可控制的图像生成和对象操纵。然而,当前基于槽的方法往往产生模糊的图像和扭曲的对象,表现出生成模型能力的不足。在本文中,我们关注改进槽到图像解码,这是高质量视觉生成的关键方面。我们介绍了slotDiffusion——一个针对图像和视频数据的 object-centric Latent Diffusion Model(LDM)。由于LDM的强大建模能力,slotDiffusion在六 datasets 的未监督对象分割和视觉生成方面超越了以前的槽模型。此外,我们学习的对象特征可以由现有的对象中心动态模型使用,提高视频预测质量和后续的时间推理任务。最后,我们展示了slotDiffusion对无约束的现实世界数据集如PASCAL VOC和COCO的 scalability。在与自监督预训练图像编码器集成时,我们证明了slotDiffusion的可扩展性。
https://arxiv.org/abs/2305.11281
Video is a promising source of knowledge for embodied agents to learn models of the world's dynamics. Large deep networks have become increasingly effective at modeling complex video data in a self-supervised manner, as evaluated by metrics based on human perceptual similarity or pixel-wise comparison. However, it remains unclear whether current metrics are accurate indicators of performance on downstream tasks. We find empirically that for planning robotic manipulation, existing metrics can be unreliable at predicting execution success. To address this, we propose a benchmark for action-conditioned video prediction in the form of a control benchmark that evaluates a given model for simulated robotic manipulation through sampling-based planning. Our benchmark, Video Prediction for Visual Planning ($VP^2$), includes simulated environments with 11 task categories and 310 task instance definitions, a full planning implementation, and training datasets containing scripted interaction trajectories for each task category. A central design goal of our benchmark is to expose a simple interface -- a single forward prediction call -- so it is straightforward to evaluate almost any action-conditioned video prediction model. We then leverage our benchmark to study the effects of scaling model size, quantity of training data, and model ensembling by analyzing five highly-performant video prediction models, finding that while scale can improve perceptual quality when modeling visually diverse settings, other attributes such as uncertainty awareness can also aid planning performance.
视频是身体参与 agents 学习世界动态模型的有前途的知识来源。大型深网络在以自我监督方式建模复杂的视频数据方面变得越来越有效,以基于人类感知相似性或像素比较的度量指标进行评估。然而,目前仍不清楚当前度量指标是否准确地反映了下游任务的表现。我们经验证,对于规划机器人操纵,现有的度量指标在预测执行成功方面的可靠性是不可靠的。为了解决这个问题,我们提出了一种控制基准,以作为评估通过采样计划模拟机器人操纵给定模型的基准。我们的基准是“视频预测Visual Planning”(VP2),包括模拟环境,每个任务类别具有11个任务分类和310个任务实例定义,以及完整的计划实施和包含每个任务类别的编程交互轨迹的训练数据集。我们的基准的核心设计目标是暴露一个简单的接口——一个单一的预测向前调用——以便几乎可以直接评估任何行动条件的视频预测模型。然后我们利用我们的基准来分析五家表现优异的视频预测模型,发现虽然规模可以在建模视觉多样性的情境中改善感知质量,但其他属性,如不确定性意识,也可以协助规划性能。
https://arxiv.org/abs/2304.13723
In this paper, we explore the impact of adding tactile sensation to video prediction models for physical robot interactions. Predicting the impact of robotic actions on the environment is a fundamental challenge in robotics. Current methods leverage visual and robot action data to generate video predictions over a given time period, which can then be used to adjust robot actions. However, humans rely on both visual and tactile feedback to develop and maintain a mental model of their physical surroundings. In this paper, we investigate the impact of integrating tactile feedback into video prediction models for physical robot interactions. We propose three multi-modal integration approaches and compare the performance of these tactile-enhanced video prediction models. Additionally, we introduce two new datasets of robot pushing that use a magnetic-based tactile sensor for unsupervised learning. The first dataset contains visually identical objects with different physical properties, while the second dataset mimics existing robot-pushing datasets of household object clusters. Our results demonstrate that incorporating tactile feedback into video prediction models improves scene prediction accuracy and enhances the agent's perception of physical interactions and understanding of cause-effect relationships during physical robot interactions.
在本文中,我们对将触觉感觉添加到视频预测模型中对于实际机器人交互的影响进行了研究。预测机器人行动对环境的影响是机器人学中的一个基本挑战。目前的方法利用视觉和机器人行动数据生成在特定时间段内的视频预测,然后可用于调整机器人行动。然而,人类依赖视觉和触觉反馈来发展和维持他们的物理周围环境的心理模型。在本文中,我们研究了将触觉反馈整合到视频预测模型中对于实际机器人交互的影响。我们提出了三种多模态融合方法,并比较了这些触觉增强的视频预测模型的性能。此外,我们介绍了两个新的机器人推动数据集,这些数据集使用基于磁性的触觉传感器进行无监督学习。第一个数据集包含具有不同物理性质的视觉相同的物体,而第二个数据集模拟了家庭物品簇现有的机器人推动数据集。我们的结果表明,将触觉反馈添加到视频预测模型中可以提高场景预测的准确性,增强Agent在物理机器人交互中对物理互动和因果关系的理解。
https://arxiv.org/abs/2304.11193
The drastic variation of motion in spatial and temporal dimensions makes the video prediction task extremely challenging. Existing RNN models obtain higher performance by deepening or widening the model. They obtain the multi-scale features of the video only by stacking layers, which is inefficient and brings unbearable training costs (such as memory, FLOPs, and training time). Different from them, this paper proposes a spatiotemporal multi-scale model called MS-LSTM wholly from a multi-scale perspective. On the basis of stacked layers, MS-LSTM incorporates two additional efficient multi-scale designs to fully capture spatiotemporal context information. Concretely, we employ LSTMs with mirrored pyramid structures to construct spatial multi-scale representations and LSTMs with different convolution kernels to construct temporal multi-scale representations. Detailed comparison experiments with eight baseline models on four video datasets show that MS-LSTM has better performance but lower training costs.
在空间和时间维度上的变化剧烈使得视频预测任务极其具有挑战性。现有的RNN模型通过加深或拓宽模型来获得更高的性能。他们只能通过堆叠层来获取视频的多尺度特征,这非常低效并且带来了难以承受的训练成本(如内存、Flops、训练时间)。与它们不同,本文提出了一种名为MS-LSTM的时空多尺度模型,完全从多尺度的角度提出。基于堆叠的层,MS-LSTM引入了两个高效的多尺度设计,以完全捕捉时空上下文信息。具体而言,我们使用具有 mirrorPyramid结构的LSTM构建空间多尺度表示,使用不同的卷积核构建时间多尺度表示。在四个视频数据集上与8个基准模型进行详细的比较实验,结果表明MS-LSTM具有更好的性能,但训练成本更低。
https://arxiv.org/abs/2304.07724
We present a novel approach for modeling vegetation response to weather in Europe as measured by the Sentinel 2 satellite. Existing satellite imagery forecasting approaches focus on photorealistic quality of the multispectral images, while derived vegetation dynamics have not yet received as much attention. We leverage both spatial and temporal context by extending state-of-the-art video prediction methods with weather guidance. We extend the EarthNet2021 dataset to be suitable for vegetation modeling by introducing a learned cloud mask and an appropriate evaluation scheme. Qualitative and quantitative experiments demonstrate superior performance of our approach over a wide variety of baseline methods, including leading approaches to satellite imagery forecasting. Additionally, we show how our modeled vegetation dynamics can be leveraged in a downstream task: inferring gross primary productivity for carbon monitoring. To the best of our knowledge, this work presents the first models for continental-scale vegetation modeling at fine resolution able to capture anomalies beyond the seasonal cycle, thereby paving the way for predictive assessments of vegetation status.
我们提出了一种新方法,用于模拟欧洲根据Sentinel 2卫星测量的气象对植被反应。现有的卫星图像预测方法主要关注多光谱图像的逼真质量,但相关的植被动态还未得到足够的关注和重视。我们利用时空 context 的优势,通过引入先进的视频预测方法,结合气象指导,将 EarthNet2021 数据集扩展为适合植被建模,并引入了学习 cloud mask 和适当的评估 scheme。定性和定量实验表明,我们的方法比许多基准方法表现更好,包括卫星图像预测的领先地位。此外,我们展示了我们模型的植被动态如何在后续任务中利用:推断碳监测的 gross primary productivity。据我们所知,这项工作提出了大陆尺度植被建模的高精度模型,能够捕捉到季节之外异常情况,为预测植被状态进行评估开辟了道路。
https://arxiv.org/abs/2303.16198
Imagining the future trajectory is the key for robots to make sound planning and successfully reach their goals. Therefore, text-conditioned video prediction (TVP) is an essential task to facilitate general robot policy learning, i.e., predicting future video frames with a given language instruction and reference frames. It is a highly challenging task to ground task-level goals specified by instructions and high-fidelity frames together, requiring large-scale data and computation. To tackle this task and empower robots with the ability to foresee the future, we propose a sample and computation-efficient model, named \textbf{Seer}, by inflating the pretrained text-to-image (T2I) stable diffusion models along the temporal axis. We inflate the denoising U-Net and language conditioning model with two novel techniques, Autoregressive Spatial-Temporal Attention and Frame Sequential Text Decomposer, to propagate the rich prior knowledge in the pretrained T2I models across the frames. With the well-designed architecture, Seer makes it possible to generate high-fidelity, coherent, and instruction-aligned video frames by fine-tuning a few layers on a small amount of data. The experimental results on Something Something V2 (SSv2) and Bridgedata datasets demonstrate our superior video prediction performance with around 210-hour training on 4 RTX 3090 GPUs: decreasing the FVD of the current SOTA model from 290 to 200 on SSv2 and achieving at least 70\% preference in the human evaluation.
想象未来的轨迹是机器人进行计划的关键,并成功达到其目标。因此,基于文本的图像预测(TVP)是促进一般机器人政策学习的必要的任务,即预测给定语言指令和参考帧的未来视频帧。这是一个高度挑战的任务,以 ground task-level goals 协同指定的高保真度帧的目标,需要大规模数据和计算。为了解决这个问题,并赋予机器人预见未来的能力,我们提议一个样本和计算效率高的模型,名为 \textbf{Seer},通过在时间轴上膨胀预训练的文本到图像稳定扩散模型。我们膨胀去噪 U-Net 和语言 conditioning 模型,使用两个新技术,即自回归的空间和时间注意力和帧序列文本分解,来传播预训练的 T2I 模型中的丰富先验知识,在每个帧上传播。通过精心设计的架构,Seer 使得生成高保真度、同步和指令对齐的视频帧通过微调少量的层在少量数据上fine-tuning能够实现。在 something something V2(SSv2)和Bridgedata 数据集的实验结果证明了我们在大约 210 小时训练的 4 RTX 3090 GPU 上表现出卓越的视频预测性能:将当前领先的模型 FVD 从 290 降低到 200 在SSv2 上,并在人类评估中实现至少 70\% 的偏好。
https://arxiv.org/abs/2303.14897
The performance of video prediction has been greatly boosted by advanced deep neural networks. However, most of the current methods suffer from large model sizes and require extra inputs, e.g., semantic/depth maps, for promising performance. For efficiency consideration, in this paper, we propose a Dynamic Multi-scale Voxel Flow Network (DMVFN) to achieve better video prediction performance at lower computational costs with only RGB images, than previous methods. The core of our DMVFN is a differentiable routing module that can effectively perceive the motion scales of video frames. Once trained, our DMVFN selects adaptive sub-networks for different inputs at the inference stage. Experiments on several benchmarks demonstrate that our DMVFN is an order of magnitude faster than Deep Voxel Flow and surpasses the state-of-the-art iterative-based OPT on generated image quality. Our code and demo are available at this https URL.
视频预测的性能已经得到了高级深度学习网络的大大提高。然而,当前的方法大多数都面临着大型模型大小的问题,并需要额外的输入,例如语义/深度地图,以表现出良好的性能。为了考虑效率,在本文中,我们提出了一种动态多尺度 Voxel 流网络(DMVFN),可以在仅使用RGB图像的情况下,比先前方法实现更好的视频预测性能,而代价更低的计算成本。我们 DMVFN 的核心是一种可区分的路由模块,可以有效地感知视频帧的运动尺度。一旦训练完成,我们的 DMVFN 在推理阶段选择自适应子网络,以不同的输入。对多个基准测试对象的实验表明,我们的 DMVFN 比深度 Voxel 流更快,并且在生成图像质量方面超越了最先进的迭代基于优化方法。我们的代码和演示可以在这个 https URL 上找到。
https://arxiv.org/abs/2303.09875
Video prediction is a complex time-series forecasting task with great potential in many use cases. However, conventional methods overemphasize accuracy while ignoring the slow prediction speed caused by complicated model structures that learn too much redundant information with excessive GPU memory consumption. Furthermore, conventional methods mostly predict frames sequentially (frame-by-frame) and thus are hard to accelerate. Consequently, valuable use cases such as real-time danger prediction and warning cannot achieve fast enough inference speed to be applicable in reality. Therefore, we propose a transformer-based keypoint prediction neural network (TKN), an unsupervised learning method that boost the prediction process via constrained information extraction and parallel prediction scheme. TKN is the first real-time video prediction solution to our best knowledge, while significantly reducing computation costs and maintaining other performance. Extensive experiments on KTH and Human3.6 datasets demonstrate that TKN predicts 11 times faster than existing methods while reducing memory consumption by 17.4% and achieving state-of-the-art prediction performance on average.
视频预测是一种在许多使用场景中具有巨大潜力的复杂的时间序列预测任务。然而,传统的方法过于强调准确性,而忽视了由于复杂的模型结构导致学习过多的冗余信息以及GPU内存消耗过高所带来的缓慢预测速度。此外,传统的方法大多顺序预测帧(帧一帧),因此很难加速。因此,我们提出了基于Transformer的关键帧预测神经网络(TKN),一种无监督学习方法,通过限制信息提取和并行预测方案来增强预测过程。TKN是我们所知的实时视频预测解决方案的第一个方法,同时显著降低计算成本并维持其他性能。在KTH和Human3.6数据集上进行广泛的实验表明,TKN预测速度比现有方法快11倍,同时减少了内存消耗,平均实现最先进的预测性能。
https://arxiv.org/abs/2303.09807