In this paper, we introduce the first large-scale video prediction model in the autonomous driving discipline. To eliminate the restriction of high-cost data collection and empower the generalization ability of our model, we acquire massive data from the web and pair it with diverse and high-quality text descriptions. The resultant dataset accumulates over 2000 hours of driving videos, spanning areas all over the world with diverse weather conditions and traffic scenarios. Inheriting the merits from recent latent diffusion models, our model, dubbed GenAD, handles the challenging dynamics in driving scenes with novel temporal reasoning blocks. We showcase that it can generalize to various unseen driving datasets in a zero-shot manner, surpassing general or driving-specific video prediction counterparts. Furthermore, GenAD can be adapted into an action-conditioned prediction model or a motion planner, holding great potential for real-world driving applications.
在本文中,我们在自动驾驶领域引入了第一个大规模视频预测模型。为了消除高成本数据收集的限制并增强模型的泛化能力,我们从互联网上获取大量数据,并将其与多样且高质量的文本描述相结合。这样产生的数据集累积了超过2000小时的驾驶视频,覆盖了世界各地各种天气条件和交通场景。在继承最近来自潜在扩散模型的优点的基础上,我们称之为GenAD的模型处理了驾驶场景中具有新颖的时间推理单元的挑战性动态。我们展示,它可以以零散的方式推广到各种未见过的驾驶数据集,超越了通用或驾驶特定视频预测的替代品。此外,GenAD可以改编成动作条件预测模型或运动规划器,在现实驾驶应用中具有巨大的潜力。
https://arxiv.org/abs/2403.09630
We introduce a new approach using computer vision to predict the land surface displacement from subsurface geometry images for Carbon Capture and Sequestration (CCS). CCS has been proved to be a key component for a carbon neutral society. However, scientists see there are challenges along the way including the high computational cost due to the large model scale and limitations to generalize a pre-trained model with complex physics. We tackle those challenges by training models directly from the subsurface geometry images. The goal is to understand the respons of land surface displacement due to carbon injection and utilize our trained models to inform decision making in CCS projects. We implement multiple models (CNN, ResNet, and ResNetUNet) for static mechanics problem, which is a image prediction problem. Next, we use the LSTM and transformer for transient mechanics scenario, which is a video prediction problem. It shows ResNetUNet outperforms the others thanks to its architecture in static mechanics problem, and LSTM shows comparable performance to transformer in transient problem. This report proceeds by outlining our dataset in detail followed by model descriptions in method section. Result and discussion state the key learning, observations, and conclusion with future work rounds out the paper.
我们使用计算机视觉提出了一种新的方法来预测地下几何图像中的土地表面位移,用于碳捕获和储存(CCS)。已经证明,CCS 是实现碳中性社会的关键组件。然而,科学家们认为在道路上存在挑战,包括由于大型模型规模而产生的高计算成本以及无法对复杂物理进行推广的限制。我们通过直接从地下几何图像中训练模型来应对这些挑战。目标是理解由于碳注入导致的土地表面位移的责任,并利用我们训练好的模型来指导CCS项目中的决策。我们在静态力学问题中采用了多种模型(CNN,ResNet和ResNetUNet),这是一种图像预测问题。接下来,我们使用LSTM和Transformer来处理瞬态力学场景,这是一种视频预测问题。结果和讨论部分说明了关键的学习、观察和结论,以及未来工作的展望。
https://arxiv.org/abs/2403.06025
The absence of openly accessible data and specialized foundation models is a major barrier for computational research in surgery. Toward this, (i) we open-source the largest dataset of general surgery videos to-date, consisting of 680 hours of surgical videos, including data from robotic and laparoscopic techniques across 28 procedures; (ii) we propose a technique for video pre-training a general surgery vision transformer (GSViT) on surgical videos based on forward video prediction that can run in real-time for surgical applications, toward which we open-source the code and weights of GSViT; (iii) we also release code and weights for procedure-specific fine-tuned versions of GSViT across 10 procedures; (iv) we demonstrate the performance of GSViT on the Cholec80 phase annotation task, displaying improved performance over state-of-the-art single frame predictors.
手术计算研究的一个主要障碍是公开可访问的数据和专门基础模型缺失。为了克服这一障碍,我们:(一)开源了迄今为止最大的普通手术视频数据集,包括28个手术过程的机器人技术和腹腔镜技术数据,总时长为680小时;(二)我们提出了一个基于向前视频预测的视频预训练方法,可以运行在实时手术应用中,向量包括GSViT的前向预测;(三)我们还发布了10个手术过程的特定微调版本GSViT的代码和权重;(四)我们在Cholec80阶段注释任务上展示了GSViT的性能,显示了与最先进的单帧预测器相比的明显改进。
https://arxiv.org/abs/2403.05949
Exponential Moving Average (EMA) is a widely used weight averaging (WA) regularization to learn flat optima for better generalizations without extra cost in deep neural network (DNN) optimization. Despite achieving better flatness, existing WA methods might fall into worse final performances or require extra test-time computations. This work unveils the full potential of EMA with a single line of modification, i.e., switching the EMA parameters to the original model after each epoch, dubbed as Switch EMA (SEMA). From both theoretical and empirical aspects, we demonstrate that SEMA can help DNNs to reach generalization optima that better trade-off between flatness and sharpness. To verify the effectiveness of SEMA, we conduct comparison experiments with discriminative, generative, and regression tasks on vision and language datasets, including image classification, self-supervised learning, object detection and segmentation, image generation, video prediction, attribute regression, and language modeling. Comprehensive results with popular optimizers and networks show that SEMA is a free lunch for DNN training by improving performances and boosting convergence speeds.
指数移动平均(EMA)是一种广泛使用的加权平均(WA)正则化方法,用于学习在深度神经网络(DNN)中平滑的优化解,同时不产生额外的优化成本。尽管达到了更好的平滑度,但现有的WA方法可能会陷入更差的最终性能,或者需要额外的测试时间计算。这项工作揭示了EMA的全套潜力,只需对每个epoch进行一次修改,即在每次迭代后将EMA参数更改为原始模型,称之为切换EMA(SEMA)。从理论和实证两个方面来看,我们证明了SEMA可以帮助DNN达到在平滑性和尖度之间进行更好权衡的泛化优化解。为了验证SEMA的有效性,我们在视觉和语言数据集上进行了包括图像分类、自监督学习、目标检测和分割、图像生成、视频预测、属性回归和语言建模等在内的比较实验。使用流行的优化器和网络的结果表明,SEMA是通过提高性能和加快收敛速度来给DNN训练带来免费午餐的好方法。
https://arxiv.org/abs/2402.09240
Video prediction, a fundamental task in computer vision, aims to enable models to generate sequences of future frames based on existing video content. This task has garnered widespread application across various domains. In this paper, we comprehensively survey both historical and contemporary works in this field, encompassing the most widely used datasets and algorithms. Our survey scrutinizes the challenges and evolving landscape of video prediction within the realm of computer vision. We propose a novel taxonomy centered on the stochastic nature of video prediction algorithms. This taxonomy accentuates the gradual transition from deterministic to generative prediction methodologies, underlining significant advancements and shifts in approach.
视频预测是计算机视觉中一个基本任务,旨在使模型根据现有视频内容生成未来帧序列。这个任务在各种领域都得到了广泛应用。在本文中,我们全面调查了这个领域的 historical 和 contemporary 作品,涵盖了最常用的数据集和算法。我们的调查深入研究了视频预测在计算机视觉领域中的挑战和演变。我们提出了一个以视频预测算法随机性为基础的新分类器。这个分类器强调了从确定性预测方法向生成性预测方法的逐步转变,突出了方法的重大改进和思路的转变。
https://arxiv.org/abs/2401.14718
Despite recent advances in video action recognition achieving strong performance on existing benchmarks, these models often lack robustness when faced with natural distribution shifts between training and test data. We propose two novel evaluation methods to assess model resilience to such distribution disparity. One method uses two different datasets collected from different sources and uses one for training and validation, and the other for testing. More precisely, we created dataset splits of HMDB-51 or UCF-101 for training, and Kinetics-400 for testing, using the subset of the classes that are overlapping in both train and test datasets. The other proposed method extracts the feature mean of each class from the target evaluation dataset's training data (i.e. class prototype) and estimates test video prediction as a cosine similarity score between each sample to the class prototypes of each target class. This procedure does not alter model weights using the target dataset and it does not require aligning overlapping classes of two different datasets, thus is a very efficient method to test the model robustness to distribution shifts without prior knowledge of the target distribution. We address the robustness problem by adversarial augmentation training - generating augmented views of videos that are "hard" for the classification model by applying gradient ascent on the augmentation parameters - as well as "curriculum" scheduling the strength of the video augmentations. We experimentally demonstrate the superior performance of the proposed adversarial augmentation approach over baselines across three state-of-the-art action recognition models - TSM, Video Swin Transformer, and Uniformer. The presented work provides critical insight into model robustness to distribution shifts and presents effective techniques to enhance video action recognition performance in a real-world deployment.
尽管在视频动作识别领域最近取得了在现有基准测试中实现强劲性能的进步,但这些模型在面临训练和测试数据之间的自然分布差异时通常缺乏鲁棒性。我们提出了两种新的评估方法来评估模型对这种分布不灵性的鲁棒性。一种方法使用来自不同来源的两个不同的数据集,并将其中一个用于训练和验证,另一个用于测试。更具体地说,我们使用训练和测试数据中重叠的类别的子集来创建HMDB-51或UCF-101的数据集划分,并将Kinetics-400用于测试。另一种方法从目标评估数据集的训练数据中提取每个类的特征均值,并估计测试视频对每个目标类别的余弦相似分数。这个过程不使用目标数据集上的模型权重,并且不需要对两个不同数据集中的重叠类别进行对齐,因此是一种非常有效的测试模型对分布不灵性的方法,而不需要先验知识 of the target distribution。我们通过 adversarial augmentation training - 对 augmentation parameters 应用梯度上升方法,生成对分类模型来说“困难”的视频的增强视图 - 以及 "曲线" 地安排视频增强的强度来解决鲁棒性问题。我们通过实验证明了所提出的 adversarial augmentation 方法在三个最先进的动作识别模型 - TSM,Video Swin Transformer 和 Uniformer - 上的优越性能。本研究提供了对模型对分布不灵性的关键洞察,并为实际部署场景中提高视频动作识别性能提供了有效的技术。
https://arxiv.org/abs/2401.11406
In this paper, we introduce a Key-point-guided Diffusion probabilistic Model (KDM) that gains precise control over images by manipulating the object's key-point. We propose a two-stage generative model incorporating an optical flow map as an intermediate output. By doing so, a dense pixel-wise understanding of the semantic relation between the image and sparse key point is configured, leading to more realistic image generation. Additionally, the integration of optical flow helps regulate the inter-frame variance of sequential images, demonstrating an authentic sequential image generation. The KDM is evaluated with diverse key-point conditioned image synthesis tasks, including facial image generation, human pose synthesis, and echocardiography video prediction, demonstrating the KDM is proving consistency enhanced and photo-realistic images compared with state-of-the-art models.
在本文中,我们提出了一种名为Key-point-guided Diffusion probabilistic Model(KDM)的模型,通过操纵对象的 key-point来精确控制图像。我们提出了一种包含光学流图的中间输出两级生成模型。通过这样做,我们获得了对图像中视觉关系进行全面理解,从而实现了更逼真的图像生成。此外,光流图的整合有助于调节连续图像之间的帧间方差,证明了KDM具有与最先进模型相同的真实感和照片写实感。KDM通过各种条件下的键点约束图像合成任务进行了评估,包括面部图像生成、人体姿态合成和超声心动图视频预测,证明了KDM与最先进模型的 consistency得到了增强,同时保持了照片写实感。
https://arxiv.org/abs/2401.08178
Predicting future frames of a video is challenging because it is difficult to learn the uncertainty of the underlying factors influencing their contents. In this paper, we propose a novel video prediction model, which has infinite-dimensional latent variables over the spatio-temporal domain. Specifically, we first decompose the video motion and content information, then take a neural stochastic differential equation to predict the temporal motion information, and finally, an image diffusion model autoregressively generates the video frame by conditioning on the predicted motion feature and the previous frame. The better expressiveness and stronger stochasticity learning capability of our model lead to state-of-the-art video prediction performances. As well, our model is able to achieve temporal continuous prediction, i.e., predicting in an unsupervised way the future video frames with an arbitrarily high frame rate. Our code is available at \url{this https URL}.
预测视频的未来帧是一个具有挑战性的任务,因为很难学习影响其内容的潜在因素的不确定性。在本文中,我们提出了一个新颖的视频预测模型,该模型在时空域具有无限维的潜在变量。具体来说,我们首先将视频运动和内容信息分解,然后使用神经随机微分方程预测时间运动信息,最后,一个图像扩散模型通过条件生成视频帧,该模型基于预测的运动特征和前帧。我们模型的表达力和随机性学习能力使其达到最先进的视频预测性能。此外,我们的模型还能够实现时间连续预测,即以任意高帧率为自监督方式预测未来的视频帧。我们的代码可在此处下载。
https://arxiv.org/abs/2312.06486
Long-term urban mobility predictions play a crucial role in the effective management of urban facilities and services. Conventionally, urban mobility data has been structured as spatiotemporal videos, treating longitude and latitude grids as fundamental pixels. Consequently, video prediction methods, relying on Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), have been instrumental in this domain. In our research, we introduce a fresh perspective on urban mobility prediction. Instead of oversimplifying urban mobility data as traditional video data, we regard it as a complex multivariate time series. This perspective involves treating the time-varying values of each grid in each channel as individual time series, necessitating a thorough examination of temporal dynamics, cross-variable correlations, and frequency-domain insights for precise and reliable predictions. To address this challenge, we present the Super-Multivariate Urban Mobility Transformer (SUMformer), which utilizes a specially designed attention mechanism to calculate temporal and cross-variable correlations and reduce computational costs stemming from a large number of time series. SUMformer also employs low-frequency filters to extract essential information for long-term predictions. Furthermore, SUMformer is structured with a temporal patch merge mechanism, forming a hierarchical framework that enables the capture of multi-scale correlations. Consequently, it excels in urban mobility pattern modeling and long-term prediction, outperforming current state-of-the-art methods across three real-world datasets.
长期城市流动性预测在有效管理城市设施和服务方面发挥着关键作用。通常,城市流动性数据被结构化为栅格状的时空视频,将经度纬度网格视为基本像素。因此,依赖于卷积神经网络(CNNs)和视觉变换器(ViTs)的视频预测方法在领域内发挥了重要作用。在我们的研究中,我们提出了一个关于城市流动性预测的新视角。我们没有将城市流动性数据简单地视为传统视频数据,而是将其视为一个复杂的多变量时间序列。这个观点包括将每个通道中每个网格的时间变化值视为独立的时间序列,这就需要对时间动态、跨变量相关性和频域见解进行深入的审查,以确保精确和可靠的预测。为了解决这个挑战,我们提出了超级多维城市流动性变换器(SUMformer),它利用专门设计的注意力机制计算时空和相关性,并降低由于大量时间序列而产生的计算成本。SUMformer还采用低频滤波器来提取长期预测所需的关键信息。此外,SUMformer采用时间补丁合并机制,形成了一个层次结构,可以捕捉多尺度相关性。因此,它在城市流动性模式建模和长期预测方面表现出色,超越了当前最先进的方法,在三个真实世界数据集上取得了较好的成绩。
https://arxiv.org/abs/2312.01699
Video prediction aims to predict future frames from a video's previous content. Existing methods mainly process video data where the time dimension mingles with the space and channel dimensions from three distinct angles: as a sequence of individual frames, as a 3D volume in spatiotemporal coordinates, or as a stacked image where frames are treated as separate channels. Most of them generally focus on one of these perspectives and may fail to fully exploit the relationships across different dimensions. To address this issue, this paper introduces a convolutional mixer for video prediction, termed ViP-Mixer, to model the spatiotemporal evolution in the latent space of an autoencoder. The ViP-Mixers are stacked sequentially and interleave feature mixing at three levels: frames, channels, and locations. Extensive experiments demonstrate that our proposed method achieves new state-of-the-art prediction performance on three benchmark video datasets covering both synthetic and real-world scenarios.
视频预测旨在预测视频的先前内容中的未来帧。现有的方法主要处理时间维度与空间和通道维度从三个不同角度混合的视频数据:作为一系列单独的帧,作为时空坐标中的3D体积,或作为堆叠图像,其中帧被视为独立的通道。大多数方法通常集中于其中的一个角度,并可能无法充分利用不同维度之间的关系。为了解决这个问题,本文提出了一种用于视频预测的卷积混合器,称为ViP-Mixer,以建模自编码器中潜在空间的时间和空间演化。ViP-Mixer逐层堆叠,并在帧、通道和位置三个级别 interleave 特征混合。大量实验证明,我们提出的方法在覆盖 both synthetic 和 real-world scenario 的三个基准视频数据集上实现了最先进的预测性能。
https://arxiv.org/abs/2311.11683
Video prediction yields future frames by employing the historical frames and has exhibited its great potential in many applications, e.g., meteorological prediction, and autonomous driving. Previous works often decode the ultimate high-level semantic features to future frames without texture details, which deteriorates the prediction quality. Motivated by this, we develop a Pair-wise Layer Attention (PLA) module to enhance the layer-wise semantic dependency of the feature maps derived from the U-shape structure in Translator, by coupling low-level visual cues and high-level features. Hence, the texture details of predicted frames are enriched. Moreover, most existing methods capture the spatiotemporal dynamics by Translator, but fail to sufficiently utilize the spatial features of Encoder. This inspires us to design a Spatial Masking (SM) module to mask partial encoding features during pretraining, which adds the visibility of remaining feature pixels by Decoder. To this end, we present a Pair-wise Layer Attention with Spatial Masking (PLA-SM) framework for video prediction to capture the spatiotemporal dynamics, which reflect the motion trend. Extensive experiments and rigorous ablation studies on five benchmarks demonstrate the advantages of the proposed approach. The code is available at GitHub.
视频预测通过利用历史帧并展示其在许多应用中的巨大潜力,例如气象预测和自动驾驶。之前的工作通常将最终高级语义特征解码为未来帧,而没有纹理细节,这导致了预测质量的下降。为了激励这种,我们开发了一个Pair-wise Layer Attention(PLA)模块,通过结合低级视觉提示和高级特征,增强从U形结构中提取的特征图的层间语义依赖关系。因此,预测帧的纹理细节得到丰富。此外,大多数现有方法通过Translator捕捉了语义动态,但未能充分利用编码器的空间特征。这启发了我们设计一个Spatial Masking(SM)模块,在预训练期间遮盖部分编码特征,通过Decoder增加剩余特征像素的可视性。为此,我们提出了一个PLA-SM框架用于视频预测,以捕捉运动趋势。在五个基准测试中进行广泛实验和严谨的消融研究证明了所提出方法的优势。代码可以在GitHub上找到。
https://arxiv.org/abs/2311.11289
Recently video generation has achieved substantial progress with realistic results. Nevertheless, existing AI-generated videos are usually very short clips ("shot-level") depicting a single scene. To deliver a coherent long video ("story-level"), it is desirable to have creative transition and prediction effects across different clips. This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction. The goal is to generate high-quality long videos with smooth and creative transitions between scenes and varying lengths of shot-level videos. Specifically, we propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions. By providing the images of different scenes as inputs, combined with text-based control, our model generates transition videos that ensure coherence and visual quality. Furthermore, the model can be readily extended to various tasks such as image-to-video animation and autoregressive video prediction. To conduct a comprehensive evaluation of this new generative task, we propose three assessing criteria for smooth and creative transition: temporal consistency, semantic similarity, and video-text semantic alignment. Extensive experiments validate the effectiveness of our approach over existing methods for generative transition and prediction, enabling the creation of story-level long videos. Project page: this https URL .
近年来,视频生成取得了实质性进展,并实现了真实的效果。然而,现有的AI生成的视频通常都很短,仅描绘了一个场景。为了生成连贯的长视频(故事层),需要在不同的片段之间实现创意的过渡和预测效果。本文介绍了一个短到长视频扩散模型SEINE,它专注于生成过渡和预测。目标是生成高质量的长视频,场景之间的过渡和 shot-level 视频的长度不同。具体来说,我们提出了一个基于文本描述的随机掩码视频扩散模型,用于自动生成过渡。通过提供不同场景的图像作为输入,结合文本控制,我们的模型生成确保连贯性和视觉质量的过渡视频。此外,模型还可以扩展到各种任务,如图像到视频动画和自回归视频预测。为了全面评估这项新的生成任务,我们提出了三个评估标准:时间一致性、语义相似性和视频文本语义对齐。大量实验证实了我们的方法在生成过渡和预测方面比现有方法更有效,从而可以创建故事层长视频。项目页面:此链接 。
https://arxiv.org/abs/2310.20700
We present a novel variational framework for performing inference in (neural) stochastic differential equations (SDEs) driven by Markov-approximate fractional Brownian motion (fBM). SDEs offer a versatile tool for modeling real-world continuous-time dynamic systems with inherent noise and randomness. Combining SDEs with the powerful inference capabilities of variational methods, enables the learning of representative function distributions through stochastic gradient descent. However, conventional SDEs typically assume the underlying noise to follow a Brownian motion (BM), which hinders their ability to capture long-term dependencies. In contrast, fractional Brownian motion (fBM) extends BM to encompass non-Markovian dynamics, but existing methods for inferring fBM parameters are either computationally demanding or statistically inefficient. In this paper, building upon the Markov approximation of fBM, we derive the evidence lower bound essential for efficient variational inference of posterior path measures, drawing from the well-established field of stochastic analysis. Additionally, we provide a closed-form expression to determine optimal approximation coefficients. Furthermore, we propose the use of neural networks to learn the drift, diffusion and control terms within our variational posterior, leading to the variational training of neural-SDEs. In this framework, we also optimize the Hurst index, governing the nature of our fractional noise. Beyond validation on synthetic data, we contribute a novel architecture for variational latent video prediction,-an approach that, to the best of our knowledge, enables the first variational neural-SDE application to video perception.
我们提出了一个新的用于(神经)随机微分方程(SDE)驱动的马尔可夫近似分数布朗运动(fBM)进行推理的框架。SDE提供了一个灵活的工具来建模具有固有噪声和随机性的真实世界连续时间动态系统。将SDE与变分方法的力量相结合,可以通过随机梯度下降学习具有代表性的函数分布。然而,传统的SDE通常假设底层噪声遵循布朗运动(BM),这阻碍了它们捕捉长期依赖关系的能力。相比之下,分数布朗运动(fBM)扩展了BM,涵盖了非马尔可夫过程,但现有的对fBM参数进行推断的方法要么计算复杂度高,要么统计效率低。在本文中,我们在fBM的马尔可夫近似的基础上,推导出用于有效进行后验路径测量的证据下界,并从随机分析领域得到 established 的结论。此外,我们还提供了确定最优逼近系数的 closed-form 表达式。另外,我们提出使用神经网络来学习我们随后的变分后验中的漂移、扩散和控制项,从而实现神经-SDE的变分训练。在这个框架中,我们还优化了Hurst指数,控制了我们的分数噪声的本质。除了在合成数据上的验证外,我们还为变分随机场预测提供了一种新的架构,这种方法,据我们所知,是第一个变分神经-SDE应用视频感知。
https://arxiv.org/abs/2310.12975
Connected autonomous vehicles (CAVs) promise to enhance safety, efficiency, and sustainability in urban transportation. However, this is contingent upon a CAV correctly predicting the motion of surrounding agents and planning its own motion safely. Doing so is challenging in complex urban environments due to frequent occlusions and interactions among many agents. One solution is to leverage smart infrastructure to augment a CAV's situational awareness; the present work leverages a recently proposed "Self-Supervised Traffic Advisor" (SSTA) framework of smart sensors that teach themselves to generate and broadcast useful video predictions of road users. In this work, SSTA predictions are modified to predict future occupancy instead of raw video, which reduces the data footprint of broadcast predictions. The resulting predictions are used within a planning framework, demonstrating that this design can effectively aid CAV motion planning. A variety of numerical experiments study the key factors that make SSTA outputs useful for practical CAV planning in crowded urban environments.
联网自主车辆(CAV)承诺提高城市交通的安全性、效率和可持续性。然而,这需要CAV准确预测周围车辆的移动并安全地规划自己的移动。在复杂的城市环境中,由于许多车辆的频繁阻挡和交互,这样做很困难。一种解决方案是利用智能基础设施来提高CAV的环境意识;目前的研究利用最近提出的智能传感器提出的“自主交通顾问”(SSTA)框架,这些传感器自我学习生成和广播道路使用者有用的视频预测。在本研究中,SSTA的预测被修改,以预测未来的使用情况,而不是原始视频,从而减少了广播预测的数据足迹。结果的预测被用于规划框架中,这表明这种方法可以有效地帮助CAV移动规划,在拥挤的城市环境中提高CAV的使用效率。多种数值实验研究了使SSTA输出在拥挤城市环境中对CAV规划有用的关键因素。
https://arxiv.org/abs/2309.07504
A central challenge of video prediction lies where the system has to reason the objects' future motions from image frames while simultaneously maintaining the consistency of their appearances across frames. This work introduces an end-to-end trainable two-stream video prediction framework, Motion-Matrix-based Video Prediction (MMVP), to tackle this challenge. Unlike previous methods that usually handle motion prediction and appearance maintenance within the same set of modules, MMVP decouples motion and appearance information by constructing appearance-agnostic motion matrices. The motion matrices represent the temporal similarity of each and every pair of feature patches in the input frames, and are the sole input of the motion prediction module in MMVP. This design improves video prediction in both accuracy and efficiency, and reduces the model size. Results of extensive experiments demonstrate that MMVP outperforms state-of-the-art systems on public data sets by non-negligible large margins (about 1 db in PSNR, UCF Sports) in significantly smaller model sizes (84% the size or smaller). Please refer to this https URL for the official code and the datasets used in this paper.
视频预测的一个核心挑战是在同时保持图像帧之间外观一致性的情况下,从图像帧中推理物体的未来运动。这项工作介绍了一种端到端可训练的双向视频预测框架,即运动矩阵视频预测(MMVP),以解决这个挑战。与以前的方法通常在同一模块内处理运动预测和外观维护不同,MMVP通过构建外观无关的运动矩阵将运动和外观信息分离。运动矩阵代表输入帧中每个特征 patch 之间的时间相似性,是MMVP 运动预测模块的唯一输入。这个设计提高了视频预测的准确性和效率,并减小了模型大小。广泛的实验结果表明,MMVP在公共数据集上比最先进的系统在小模型规模下显著表现更好(PSNR:约1 db,UCF Sports:84%大小或更小)。请参考此httpsURL以查看官方代码和本文使用的数据集。
https://arxiv.org/abs/2308.16154
Video anomaly detection (VAD) is an important but challenging task in computer vision. The main challenge rises due to the rarity of training samples to model all anomaly cases. Hence, semi-supervised anomaly detection methods have gotten more attention, since they focus on modeling normals and they detect anomalies by measuring the deviations from normal patterns. Despite impressive advances of these methods in modeling normal motion and appearance, long-term motion modeling has not been effectively explored so far. Inspired by the abilities of the future frame prediction proxy-task, we introduce the task of future video prediction from a single frame, as a novel proxy-task for video anomaly detection. This proxy-task alleviates the challenges of previous methods in learning longer motion patterns. Moreover, we replace the initial and future raw frames with their corresponding semantic segmentation map, which not only makes the method aware of object class but also makes the prediction task less complex for the model. Extensive experiments on the benchmark datasets (ShanghaiTech, UCSD-Ped1, and UCSD-Ped2) show the effectiveness of the method and the superiority of its performance compared to SOTA prediction-based VAD methods.
视频异常检测(VAD)是计算机视觉中一个重要的但具有挑战性的任务。其主要挑战源于训练样本不足以模型所有异常案例的罕见性。因此,半监督异常检测方法越来越受到关注,因为它们专注于建模正常情况,并通过测量与正常模式 Deviations 的差异来检测异常。尽管这些方法在建模正常运动和外观方面取得了令人印象深刻的进步,但到目前为止,长期运动建模还没有得到 effectively 的探索。受到未来帧预测代理任务的能力启发,我们引入了从单个帧预测未来视频的任务,并将其作为视频异常检测中的新型代理任务。这个代理任务可以减轻以前方法在学习更长的运动模式方面的挑战。此外,我们替换了初始和未来的 raw 帧及其相应的语义分割地图,这不仅使方法能够识别物体类别,还使模型的预测任务变得更加简单。在基准数据集(ShanghaiTech、UCCSD-Ped1 和 UCSD-Ped2)上进行广泛的实验表明,这种方法的有效性和与 SOTA 基于预测的 VAD 方法相比的性能优越性。
https://arxiv.org/abs/2308.07783
We address the video prediction task by putting forth a novel model that combines (i) our recently proposed hierarchical residual vector quantized variational autoencoder (HR-VQVAE), and (ii) a novel spatiotemporal PixelCNN (ST-PixelCNN). We refer to this approach as a sequential hierarchical residual learning vector quantized variational autoencoder (S-HR-VQVAE). By leveraging the intrinsic capabilities of HR-VQVAE at modeling still images with a parsimonious representation, combined with the ST-PixelCNN's ability at handling spatiotemporal information, S-HR-VQVAE can better deal with chief challenges in video prediction. These include learning spatiotemporal information, handling high dimensional data, combating blurry prediction, and implicit modeling of physical characteristics. Extensive experimental results on the KTH Human Action and Moving-MNIST tasks demonstrate that our model compares favorably against top video prediction techniques both in quantitative and qualitative evaluations despite a much smaller model size. Finally, we boost S-HR-VQVAE by proposing a novel training method to jointly estimate the HR-VQVAE and ST-PixelCNN parameters.
我们解决了视频预测任务,提出了一种新模型,它结合了我们最近提出的分层残留向量量化Variational Autoencoder (HR-VQVAE)和一种新的时间空间像素卷积神经网络 (ST-PixelCNN)。我们称之为顺序分层残留向量量化Variational Autoencoder (S-HR-VQVAE)。通过利用HR-VQVAE在简单表示下建模静态图像的固有能力,并结合ST-PixelCNN在处理时间空间信息方面的能力,S-HR-VQVAE能够更好地处理视频预测的主要挑战。这些挑战包括学习时间空间信息、处理高维数据、对抗模糊的预测和隐含形态学的特征建模。在KTH人类行动和移动MNIST任务的实验结果表明,尽管模型规模较小,但在定量和定性评估中与最先进的视频预测技术进行比较时表现良好。最后,我们提出了一种新训练方法, jointly estimate HR-VQVAE和ST-PixelCNN参数,以提升S-HR-VQVAE。
https://arxiv.org/abs/2307.06701
With the increasing adoption of robots across industries, it is crucial to focus on developing advanced algorithms that enable robots to anticipate, comprehend, and plan their actions effectively in collaboration with humans. We introduce the Robot Autonomous Motion (RoAM) video dataset, which is collected with a custom-made turtlebot3 Burger robot in a variety of indoor environments recording various human motions from the robot's ego-vision. The dataset also includes synchronized records of the LiDAR scan and all control actions taken by the robot as it navigates around static and moving human agents. The unique dataset provides an opportunity to develop and benchmark new visual prediction frameworks that can predict future image frames based on the action taken by the recording agent in partially observable scenarios or cases where the imaging sensor is mounted on a moving platform. We have benchmarked the dataset on our novel deep visual prediction framework called ACPNet where the approximated future image frames are also conditioned on action taken by the robot and demonstrated its potential for incorporating robot dynamics into the video prediction paradigm for mobile robotics and autonomous navigation research.
随着各行各业对机器人的广泛应用,发展先进的算法是至关重要的。我们介绍了机器人自主运动(RoAM)视频数据集,该数据集使用定制的turtlebot3 Burger机器人在多种室内环境中录制从机器人自我意识Vision角度的各种人类运动。数据集还包括同步记录的激光扫描和机器人在静态和动态人类代理周围导航时采取的所有控制行动。该独特的数据集提供了一个机会,开发和基准新的视觉预测框架,可以在可观察场景或图像传感器安装在移动平台上时基于记录设备的采取行动预测未来图像帧。我们基准了该数据集在我们 novel 的深层视觉预测框架 ACPNet 上,其中预测的未来图像帧也取决于机器人采取的行动,并展示了将其机器人动力学纳入移动设备机器人和自主导航研究的视频预测范式的潜力。
https://arxiv.org/abs/2306.15852
General physical scene understanding requires more than simply localizing and recognizing objects -- it requires knowledge that objects can have different latent properties (e.g., mass or elasticity), and that those properties affect the outcome of physical events. While there has been great progress in physical and video prediction models in recent years, benchmarks to test their performance typically do not require an understanding that objects have individual physical properties, or at best test only those properties that are directly observable (e.g., size or color). This work proposes a novel dataset and benchmark, termed Physion++, that rigorously evaluates visual physical prediction in artificial systems under circumstances where those predictions rely on accurate estimates of the latent physical properties of objects in the scene. Specifically, we test scenarios where accurate prediction relies on estimates of properties such as mass, friction, elasticity, and deformability, and where the values of those properties can only be inferred by observing how objects move and interact with other objects or fluids. We evaluate the performance of a number of state-of-the-art prediction models that span a variety of levels of learning vs. built-in knowledge, and compare that performance to a set of human predictions. We find that models that have been trained using standard regimes and datasets do not spontaneously learn to make inferences about latent properties, but also that models that encode objectness and physical states tend to make better predictions. However, there is still a huge gap between all models and human performance, and all models' predictions correlate poorly with those made by humans, suggesting that no state-of-the-art model is learning to make physical predictions in a human-like way. Project page: this https URL
总体物理场景理解需要更多的不仅仅是定位和识别对象,而是要认识到对象可能具有不同的潜在属性(例如质量或弹性),而这些属性会影响物理事件的结果。尽管在物理和视频预测模型方面已经取得了很大进展,但测试其性能的标准通常不需要理解对象具有 individual 物理属性,或者仅测试那些直接观察到的属性(例如尺寸或颜色)。本研究提出了一个全新的数据集和基准,称为Physion++,旨在 rigorous 评估人工系统中的视觉物理预测性能,在这些预测性能依赖于对场景对象潜在属性准确估计的情况下进行测试。具体来说,我们测试了那些依赖于对属性如质量、摩擦、弹性和可变形性等进行估计的场景,并且只能通过观察对象的运动和与其他对象或液体的互动来推断这些属性的值。我们评估了多个最先进的预测模型的性能,涵盖了学习与内置知识的不同水平,并将这些性能与人类预测进行比较。我们发现,使用标准训练方法和数据集训练的模型不会自发地学习关于潜在属性的推断,但编码物体性和物理状态的倾向往往导致更好的预测性能。然而,所有模型与人类表现之间存在巨大的差距,并且所有模型的预测与人类预测之间相关性不佳,这表明没有一个最先进的模型正在学习以人类方式进行物理预测。项目页面: this https URL
https://arxiv.org/abs/2306.15668
In recent years, deep learning-based solar forecasting using all-sky images has emerged as a promising approach for alleviating uncertainty in PV power generation. However, the stochastic nature of cloud movement remains a major challenge for accurate and reliable solar forecasting. With the recent advances in generative artificial intelligence, the synthesis of visually plausible yet diversified sky videos has potential for aiding in forecasts. In this study, we introduce \emph{SkyGPT}, a physics-informed stochastic video prediction model that is able to generate multiple possible future images of the sky with diverse cloud motion patterns, by using past sky image sequences as input. Extensive experiments and comparison with benchmark video prediction models demonstrate the effectiveness of the proposed model in capturing cloud dynamics and generating future sky images with high realism and diversity. Furthermore, we feed the generated future sky images from the video prediction models for 15-minute-ahead probabilistic solar forecasting for a 30-kW roof-top PV system, and compare it with an end-to-end deep learning baseline model SUNSET and a smart persistence model. Better PV output prediction reliability and sharpness is observed by using the predicted sky images generated with SkyGPT compared with other benchmark models, achieving a continuous ranked probability score (CRPS) of 2.81 (13\% better than SUNSET and 23\% better than smart persistence) and a Winkler score of 26.70 for the test set. Although an arbitrary number of futures can be generated from a historical sky image sequence, the results suggest that 10 future scenarios is a good choice that balances probabilistic solar forecasting performance and computational cost.
近年来,使用所有天空图像的深度学习太阳能预测已成为减轻太阳能电池板发电不确定性的有前途的方法。然而,云运动的随机性质仍然是准确可靠的太阳能预测的一个主要挑战。随着生成人工智能的最新进展,通过将具有不同云运动模式的多样化天空视频合成起来,有可能帮助预测。在本研究中,我们介绍了 \emph{SkyGPT} 一个基于物理知识的随机视频预测模型,能够通过使用过去天空图像序列作为输入生成多种具有不同云运动模式的未来的天空图像。进行了广泛的实验并与其他基准视频预测模型进行了比较,证明了该模型在捕获云动态并生成高真实感和多样性的未来天空图像方面的 effectiveness。此外,我们使用视频预测模型从生成的未来天空图像中为一个30千瓦屋顶太阳能电池板进行了15分钟的 probabilistic 太阳能预测,并与 end-to-end 深度学习基线模型 SunSET 和智能坚持模型进行了比较。与其他基准模型相比,使用 SkyGPT 生成的预测天空图像在太阳能电池板输出预测可靠性和清晰度方面表现更好,连续排名概率得分(CRPS)为2.81(比 SunSET 好13\%,比智能坚持模型好23\%),测试集Winkler得分为26.70。虽然可以从历史天空图像序列中生成任意数量的未来的图像,但结果表明,考虑10个未来的场景是平衡 probabilistic 太阳能预测性能和计算成本的好选择。
https://arxiv.org/abs/2306.11682