Video_Prediction

State-space Decomposition Model for Video Prediction Considering Long-term Motion Trend

2024-04-17 17:19:48

Fei Cui, Jiaojiao Fang, Xiaojiang Wu, Zelong Lai, Mengke Yang, Menghan Jia, Guizhong Liu

arXiv_CV

arXiv_CV Prediction Pose Video_Prediction
Abstract

Stochastic video prediction enables the consideration of uncertainty in future motion, thereby providing a better reflection of the dynamic nature of the environment. Stochastic video prediction methods based on image auto-regressive recurrent models need to feed their predictions back into the latent space. Conversely, the state-space models, which decouple frame synthesis and temporal prediction, proves to be more efficient. However, inferring long-term temporal information about motion and generalizing to dynamic scenarios under non-stationary assumptions remains an unresolved challenge. In this paper, we propose a state-space decomposition stochastic video prediction model that decomposes the overall video frame generation into deterministic appearance prediction and stochastic motion prediction. Through adaptive decomposition, the model's generalization capability to dynamic scenarios is enhanced. In the context of motion prediction, obtaining a prior on the long-term trend of future motion is crucial. Thus, in the stochastic motion prediction branch, we infer the long-term motion trend from conditional frames to guide the generation of future frames that exhibit high consistency with the conditional frames. Experimental results demonstrate that our model outperforms baselines on multiple datasets.

Abstract (translated)

随机视频预测使得可以考虑未来运动中的不确定性，从而更好地反映了环境的动态性质。基于图像自回归循环模型的随机视频预测方法需要将预测结果反馈到潜在空间中。相反，基于状态空间模型的方法被证明更加有效。然而，根据非平稳假设推断运动和泛化长期信息仍然是一个未解决的挑战。在本文中，我们提出了一种基于状态空间分解的随机视频预测模型，将整个视频帧生成分解为确定性可见预测和随机运动预测。通过自适应分解，模型的泛化能力得到增强。在运动预测方面，获得对未来运动的长期趋势的先前了解至关重要。因此，在随机运动预测分支中，我们从条件帧中推断长期运动趋势，以指导生成与条件帧具有高一致性的未来帧。实验结果表明，我们的模型在多个数据集上的表现优于基线。

URL

https://arxiv.org/abs/2404.11576

PDF

https://arxiv.org/pdf/2404.11576.pdf
Read All
Predicting Long-horizon Futures by Conditioning on Geometry and Time

2024-04-17 16:56:31

Tarasha Khurana, Deva Ramanan

arXiv_CV

arXiv_CV Prediction Pose Video_Prediction Diffusion
Abstract

Our work explores the task of generating future sensor observations conditioned on the past. We are motivated by `predictive coding' concepts from neuroscience as well as robotic applications such as self-driving vehicles. Predictive video modeling is challenging because the future may be multi-modal and learning at scale remains computationally expensive for video processing. To address both challenges, our key insight is to leverage the large-scale pretraining of image diffusion models which can handle multi-modality. We repurpose image models for video prediction by conditioning on new frame timestamps. Such models can be trained with videos of both static and dynamic scenes. To allow them to be trained with modestly-sized datasets, we introduce invariances by factoring out illumination and texture by forcing the model to predict (pseudo) depth, readily obtained for in-the-wild videos via off-the-shelf monocular depth networks. In fact, we show that simply modifying networks to predict grayscale pixels already improves the accuracy of video prediction. Given the extra controllability with timestamp conditioning, we propose sampling schedules that work better than the traditional autoregressive and hierarchical sampling strategies. Motivated by probabilistic metrics from the object forecasting literature, we create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes and a large vocabulary of objects. Our experiments illustrate the effectiveness of learning to condition on timestamps, and show the importance of predicting the future with invariant modalities.

Abstract (translated)

我们的工作探讨了根据过去生成未来传感器观测值的任务。我们受到来自神经科学中的预测性编码概念以及机器人应用（如自动驾驶车辆）的启发。预测视频建模具有挑战性，因为未来可能有多模态，而且学习规模巨大的视频处理仍然具有计算成本。为了应对这两个挑战，我们的关键洞见是利用大规模预训练图像扩散模型，该模型可以处理多模态。我们将图像模型用于视频预测，通过条件于新帧时间戳来约束模型。这样的模型可以用于静态和动态场景的视频训练。为了使它们能够使用规模较小的数据集进行训练，我们通过迫使模型预测（伪）深度来引入不变性，这是通过在野外视频通过标准的单目深度网络轻易获得的。事实上，我们发现，仅将网络修改为预测灰度像素就可以提高视频预测的准确性。鉴于时间戳约束的额外可控性，我们提出了优于传统自回归和分层采样策略的采样时间表。为了激发概率统计学文献中的动机，我们为多样室内和室外视频创建了一个基准，涵盖了从室内到室外场景的大规模词汇表。我们的实验证明了学习条件于时间戳的有效性，并表明预测未来与不变模态的重要性。

URL

https://arxiv.org/abs/2404.11554

PDF

https://arxiv.org/pdf/2404.11554.pdf
Read All
TAFormer: A Unified Target-Aware Transformer for Video and Motion Joint Prediction in Aerial Scenes

2024-03-27 04:03:55

Liangyu Xu, Wanxuan Lu, Hongfeng Yu, Yongqiang Mao, Hanbo Bi, Chenglong Liu, Xian Sun, Kun Fu

arXiv_CV

arXiv_CV GAN Drone Survey Object_Tracking Tracking Attention Prediction Transformer Action Video_Prediction
Abstract

As drone technology advances, using unmanned aerial vehicles for aerial surveys has become the dominant trend in modern low-altitude remote sensing. The surge in aerial video data necessitates accurate prediction for future scenarios and motion states of the interested target, particularly in applications like traffic management and disaster response. Existing video prediction methods focus solely on predicting future scenes (video frames), suffering from the neglect of explicitly modeling target's motion states, which is crucial for aerial video interpretation. To address this issue, we introduce a novel task called Target-Aware Aerial Video Prediction, aiming to simultaneously predict future scenes and motion states of the target. Further, we design a model specifically for this task, named TAFormer, which provides a unified modeling approach for both video and target motion states. Specifically, we introduce Spatiotemporal Attention (STA), which decouples the learning of video dynamics into spatial static attention and temporal dynamic attention, effectively modeling the scene appearance and motion. Additionally, we design an Information Sharing Mechanism (ISM), which elegantly unifies the modeling of video and target motion by facilitating information interaction through two sets of messenger tokens. Moreover, to alleviate the difficulty of distinguishing targets in blurry predictions, we introduce Target-Sensitive Gaussian Loss (TSGL), enhancing the model's sensitivity to both target's position and content. Extensive experiments on UAV123VP and VisDroneVP (derived from single-object tracking datasets) demonstrate the exceptional performance of TAFormer in target-aware video prediction, showcasing its adaptability to the additional requirements of aerial video interpretation for target awareness.

Abstract (translated)

随着无人机技术的进步，使用无人机进行航空测量已成为现代低空遥感的优势趋势。高空视频数据的激增迫使对未来场景和感兴趣目标的动态状态进行准确预测，特别是在交通管理和灾害应对等领域。现有的视频预测方法仅关注预测未来场景（视频帧），忽略了明确建模目标运动状态，这是高空视频解释的关键。为解决这个问题，我们引入了一个名为 Target-Aware Aerial Video Prediction 的新任务，旨在同时预测未来场景和目标的动态状态。此外，我们为这个任务设计了一个名为 TAFormer 的模型，提供了一种统一建模视频和目标运动状态的方法。具体来说，我们引入了 Spatiotemporal Attention（STA），将视频动态学习的空间静态注意力和时间动态注意力解耦，有效建模场景外观和运动。此外，我们设计了一个信息共享机制（ISM），通过促进信息交互来统一建模视频和目标运动。为了减轻在模糊预测中区分目标的努力，我们引入了 Target-Sensitive Gaussian Loss（TSGL），提高了模型对目标位置和内容的敏感度。对于 UAV123VP 和 VisDroneVP（源于单对象跟踪数据集）的实验表明，TAFormer 在目标意识视频预测方面的表现异常出色，展示了它对空中视频解释额外需求的适应能力。

URL

https://arxiv.org/abs/2403.18238

PDF

https://arxiv.org/pdf/2403.18238.pdf
Read All
Generalized Predictive Model for Autonomous Driving

2024-03-14 17:58:33

Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, Jun Zhang, Andreas Geiger, Yu Qiao, Hongyang Li

arXiv_CV

arXiv_CV Prediction Autonomous Action Zero-Shot Video_Prediction Diffusion
Abstract

In this paper, we introduce the first large-scale video prediction model in the autonomous driving discipline. To eliminate the restriction of high-cost data collection and empower the generalization ability of our model, we acquire massive data from the web and pair it with diverse and high-quality text descriptions. The resultant dataset accumulates over 2000 hours of driving videos, spanning areas all over the world with diverse weather conditions and traffic scenarios. Inheriting the merits from recent latent diffusion models, our model, dubbed GenAD, handles the challenging dynamics in driving scenes with novel temporal reasoning blocks. We showcase that it can generalize to various unseen driving datasets in a zero-shot manner, surpassing general or driving-specific video prediction counterparts. Furthermore, GenAD can be adapted into an action-conditioned prediction model or a motion planner, holding great potential for real-world driving applications.

Abstract (translated)

在本文中，我们在自动驾驶领域引入了第一个大规模视频预测模型。为了消除高成本数据收集的限制并增强模型的泛化能力，我们从互联网上获取大量数据，并将其与多样且高质量的文本描述相结合。这样产生的数据集累积了超过2000小时的驾驶视频，覆盖了世界各地各种天气条件和交通场景。在继承最近来自潜在扩散模型的优点的基础上，我们称之为GenAD的模型处理了驾驶场景中具有新颖的时间推理单元的挑战性动态。我们展示，它可以以零散的方式推广到各种未见过的驾驶数据集，超越了通用或驾驶特定视频预测的替代品。此外，GenAD可以改编成动作条件预测模型或运动规划器，在现实驾驶应用中具有巨大的潜力。

URL

https://arxiv.org/abs/2403.09630

PDF

https://arxiv.org/pdf/2403.09630.pdf
Read All
CarbonNet: How Computer Vision Plays a Role in Climate Change? Application: Learning Geomechanics from Subsurface Geometry of CCS to Mitigate Global Warming

2024-03-09 22:25:14

Wei Chen, Yunan Li, Yuan Tian

arXiv_AI

arXiv_AI RNN Face Prediction Transformer Video_Prediction
Abstract

We introduce a new approach using computer vision to predict the land surface displacement from subsurface geometry images for Carbon Capture and Sequestration (CCS). CCS has been proved to be a key component for a carbon neutral society. However, scientists see there are challenges along the way including the high computational cost due to the large model scale and limitations to generalize a pre-trained model with complex physics. We tackle those challenges by training models directly from the subsurface geometry images. The goal is to understand the respons of land surface displacement due to carbon injection and utilize our trained models to inform decision making in CCS projects. We implement multiple models (CNN, ResNet, and ResNetUNet) for static mechanics problem, which is a image prediction problem. Next, we use the LSTM and transformer for transient mechanics scenario, which is a video prediction problem. It shows ResNetUNet outperforms the others thanks to its architecture in static mechanics problem, and LSTM shows comparable performance to transformer in transient problem. This report proceeds by outlining our dataset in detail followed by model descriptions in method section. Result and discussion state the key learning, observations, and conclusion with future work rounds out the paper.

Abstract (translated)

我们使用计算机视觉提出了一种新的方法来预测地下几何图像中的土地表面位移，用于碳捕获和储存（CCS）。已经证明，CCS 是实现碳中性社会的关键组件。然而，科学家们认为在道路上存在挑战，包括由于大型模型规模而产生的高计算成本以及无法对复杂物理进行推广的限制。我们通过直接从地下几何图像中训练模型来应对这些挑战。目标是理解由于碳注入导致的土地表面位移的责任，并利用我们训练好的模型来指导CCS项目中的决策。我们在静态力学问题中采用了多种模型（CNN，ResNet和ResNetUNet），这是一种图像预测问题。接下来，我们使用LSTM和Transformer来处理瞬态力学场景，这是一种视频预测问题。结果和讨论部分说明了关键的学习、观察和结论，以及未来工作的展望。

URL

https://arxiv.org/abs/2403.06025

PDF

https://arxiv.org/pdf/2403.06025.pdf
Read All
General surgery vision transformer: A video pre-trained foundation model for general surgery

2024-03-09 16:02:46

Samuel Schmidgall, Ji Woong Kim, Jeffery Jopling, Axel Krieger

arXiv_CV

arXiv_CV Prediction Transformer Pose Video_Prediction
Abstract

The absence of openly accessible data and specialized foundation models is a major barrier for computational research in surgery. Toward this, (i) we open-source the largest dataset of general surgery videos to-date, consisting of 680 hours of surgical videos, including data from robotic and laparoscopic techniques across 28 procedures; (ii) we propose a technique for video pre-training a general surgery vision transformer (GSViT) on surgical videos based on forward video prediction that can run in real-time for surgical applications, toward which we open-source the code and weights of GSViT; (iii) we also release code and weights for procedure-specific fine-tuned versions of GSViT across 10 procedures; (iv) we demonstrate the performance of GSViT on the Cholec80 phase annotation task, displaying improved performance over state-of-the-art single frame predictors.

Abstract (translated)

手术计算研究的一个主要障碍是公开可访问的数据和专门基础模型缺失。为了克服这一障碍，我们：（一）开源了迄今为止最大的普通手术视频数据集，包括28个手术过程的机器人技术和腹腔镜技术数据，总时长为680小时；（二）我们提出了一个基于向前视频预测的视频预训练方法，可以运行在实时手术应用中，向量包括GSViT的前向预测；（三）我们还发布了10个手术过程的特定微调版本GSViT的代码和权重；（四）我们在Cholec80阶段注释任务上展示了GSViT的性能，显示了与最先进的单帧预测器相比的明显改进。

URL

https://arxiv.org/abs/2403.05949

PDF

https://arxiv.org/pdf/2403.05949.pdf
Read All
Switch EMA: A Free Lunch for Better Flatness and Sharpness

2024-02-14 15:28:42

Siyuan Li, Zicheng Liu, Juanxi Tian, Ge Wang, Zedong Wang, Weiyang Jin, Di Wu, Cheng Tan, Tao Lin, Yang Liu, Baigui Sun, Stan Z. Li

arXiv_CV

arXiv_CV Segmentation Detection Object_Detection Regularization Classification Image_Classification Prediction Optimization Language_Model Self-Supervised Video_Prediction
Abstract

Exponential Moving Average (EMA) is a widely used weight averaging (WA) regularization to learn flat optima for better generalizations without extra cost in deep neural network (DNN) optimization. Despite achieving better flatness, existing WA methods might fall into worse final performances or require extra test-time computations. This work unveils the full potential of EMA with a single line of modification, i.e., switching the EMA parameters to the original model after each epoch, dubbed as Switch EMA (SEMA). From both theoretical and empirical aspects, we demonstrate that SEMA can help DNNs to reach generalization optima that better trade-off between flatness and sharpness. To verify the effectiveness of SEMA, we conduct comparison experiments with discriminative, generative, and regression tasks on vision and language datasets, including image classification, self-supervised learning, object detection and segmentation, image generation, video prediction, attribute regression, and language modeling. Comprehensive results with popular optimizers and networks show that SEMA is a free lunch for DNN training by improving performances and boosting convergence speeds.

Abstract (translated)

指数移动平均（EMA）是一种广泛使用的加权平均（WA）正则化方法，用于学习在深度神经网络（DNN）中平滑的优化解，同时不产生额外的优化成本。尽管达到了更好的平滑度，但现有的WA方法可能会陷入更差的最终性能，或者需要额外的测试时间计算。这项工作揭示了EMA的全套潜力，只需对每个epoch进行一次修改，即在每次迭代后将EMA参数更改为原始模型，称之为切换EMA（SEMA）。从理论和实证两个方面来看，我们证明了SEMA可以帮助DNN达到在平滑性和尖度之间进行更好权衡的泛化优化解。为了验证SEMA的有效性，我们在视觉和语言数据集上进行了包括图像分类、自监督学习、目标检测和分割、图像生成、视频预测、属性回归和语言建模等在内的比较实验。使用流行的优化器和网络的结果表明，SEMA是通过提高性能和加快收敛速度来给DNN训练带来免费午餐的好方法。

URL

https://arxiv.org/abs/2402.09240

PDF

https://arxiv.org/pdf/2402.09240.pdf
Read All
A Survey on Video Prediction: From Deterministic to Generative Approaches

2024-01-26 08:59:38

Ruibo Ming, Zhewei Huang, Zhuoxuan Ju, Jianming Hu, Lihui Peng, Shuchang Zhou

arXiv_CV

arXiv_CV Survey Prediction Pose Video_Prediction
Abstract

Video prediction, a fundamental task in computer vision, aims to enable models to generate sequences of future frames based on existing video content. This task has garnered widespread application across various domains. In this paper, we comprehensively survey both historical and contemporary works in this field, encompassing the most widely used datasets and algorithms. Our survey scrutinizes the challenges and evolving landscape of video prediction within the realm of computer vision. We propose a novel taxonomy centered on the stochastic nature of video prediction algorithms. This taxonomy accentuates the gradual transition from deterministic to generative prediction methodologies, underlining significant advancements and shifts in approach.

Abstract (translated)

视频预测是计算机视觉中一个基本任务，旨在使模型根据现有视频内容生成未来帧序列。这个任务在各种领域都得到了广泛应用。在本文中，我们全面调查了这个领域的 historical 和 contemporary 作品，涵盖了最常用的数据集和算法。我们的调查深入研究了视频预测在计算机视觉领域中的挑战和演变。我们提出了一个以视频预测算法随机性为基础的新分类器。这个分类器强调了从确定性预测方法向生成性预测方法的逐步转变，突出了方法的重大改进和思路的转变。

URL

https://arxiv.org/abs/2401.14718

PDF

https://arxiv.org/pdf/2401.14718.pdf
Read All
Adversarial Augmentation Training Makes Action Recognition Models More Robust to Realistic Video Distribution Shifts

2024-01-21 05:50:39

Kiyoon Kim, Shreyank N Gowda, Panagiotis Eustratiadis, Antreas Antoniou, Robert B Fisher

arXiv_CV

arXiv_CV Recognition Action_Recognition Adversarial Classification Face Knowledge Prediction Transformer Pose Action Video_Prediction
Abstract

Despite recent advances in video action recognition achieving strong performance on existing benchmarks, these models often lack robustness when faced with natural distribution shifts between training and test data. We propose two novel evaluation methods to assess model resilience to such distribution disparity. One method uses two different datasets collected from different sources and uses one for training and validation, and the other for testing. More precisely, we created dataset splits of HMDB-51 or UCF-101 for training, and Kinetics-400 for testing, using the subset of the classes that are overlapping in both train and test datasets. The other proposed method extracts the feature mean of each class from the target evaluation dataset's training data (i.e. class prototype) and estimates test video prediction as a cosine similarity score between each sample to the class prototypes of each target class. This procedure does not alter model weights using the target dataset and it does not require aligning overlapping classes of two different datasets, thus is a very efficient method to test the model robustness to distribution shifts without prior knowledge of the target distribution. We address the robustness problem by adversarial augmentation training - generating augmented views of videos that are "hard" for the classification model by applying gradient ascent on the augmentation parameters - as well as "curriculum" scheduling the strength of the video augmentations. We experimentally demonstrate the superior performance of the proposed adversarial augmentation approach over baselines across three state-of-the-art action recognition models - TSM, Video Swin Transformer, and Uniformer. The presented work provides critical insight into model robustness to distribution shifts and presents effective techniques to enhance video action recognition performance in a real-world deployment.

Abstract (translated)

尽管在视频动作识别领域最近取得了在现有基准测试中实现强劲性能的进步，但这些模型在面临训练和测试数据之间的自然分布差异时通常缺乏鲁棒性。我们提出了两种新的评估方法来评估模型对这种分布不灵性的鲁棒性。一种方法使用来自不同来源的两个不同的数据集，并将其中一个用于训练和验证，另一个用于测试。更具体地说，我们使用训练和测试数据中重叠的类别的子集来创建HMDB-51或UCF-101的数据集划分，并将Kinetics-400用于测试。另一种方法从目标评估数据集的训练数据中提取每个类的特征均值，并估计测试视频对每个目标类别的余弦相似分数。这个过程不使用目标数据集上的模型权重，并且不需要对两个不同数据集中的重叠类别进行对齐，因此是一种非常有效的测试模型对分布不灵性的方法，而不需要先验知识 of the target distribution。我们通过 adversarial augmentation training - 对 augmentation parameters 应用梯度上升方法，生成对分类模型来说“困难”的视频的增强视图 - 以及 "曲线" 地安排视频增强的强度来解决鲁棒性问题。我们通过实验证明了所提出的 adversarial augmentation 方法在三个最先进的动作识别模型 - TSM，Video Swin Transformer 和 Uniformer - 上的优越性能。本研究提供了对模型对分布不灵性的关键洞察，并为实际部署场景中提高视频动作识别性能提供了有效的技术。

URL

https://arxiv.org/abs/2401.11406

PDF

https://arxiv.org/pdf/2401.11406.pdf
Read All
Key-point Guided Deformable Image Manipulation Using Diffusion Model

2024-01-16 07:51:00

Seok-Hwan Oh, Guil Jung, Myeong-Gee Kim, Sang-Yun Kim, Young-Min Kim, Hyeon-Jik Lee, Hyuk-Sool Kwon, Hyeon-Min Bae

arXiv_CV

arXiv_CV Relation Sparse Prediction Pose Optical_Flow Video_Prediction Diffusion
Abstract

In this paper, we introduce a Key-point-guided Diffusion probabilistic Model (KDM) that gains precise control over images by manipulating the object's key-point. We propose a two-stage generative model incorporating an optical flow map as an intermediate output. By doing so, a dense pixel-wise understanding of the semantic relation between the image and sparse key point is configured, leading to more realistic image generation. Additionally, the integration of optical flow helps regulate the inter-frame variance of sequential images, demonstrating an authentic sequential image generation. The KDM is evaluated with diverse key-point conditioned image synthesis tasks, including facial image generation, human pose synthesis, and echocardiography video prediction, demonstrating the KDM is proving consistency enhanced and photo-realistic images compared with state-of-the-art models.

Abstract (translated)

在本文中，我们提出了一种名为Key-point-guided Diffusion probabilistic Model（KDM）的模型，通过操纵对象的 key-point来精确控制图像。我们提出了一种包含光学流图的中间输出两级生成模型。通过这样做，我们获得了对图像中视觉关系进行全面理解，从而实现了更逼真的图像生成。此外，光流图的整合有助于调节连续图像之间的帧间方差，证明了KDM具有与最先进模型相同的真实感和照片写实感。KDM通过各种条件下的键点约束图像合成任务进行了评估，包括面部图像生成、人体姿态合成和超声心动图视频预测，证明了KDM与最先进模型的 consistency得到了增强，同时保持了照片写实感。

URL

https://arxiv.org/abs/2401.08178

PDF

https://arxiv.org/pdf/2401.08178.pdf
Read All
STDiff: Spatio-temporal Diffusion for Continuous Stochastic Video Prediction

2023-12-11 16:12:43

Xi Ye, Guillaume-Alexandre Bilodeau

arXiv_CV

arXiv_CV Prediction Unsupervised Pose Video_Prediction Diffusion
Abstract

Predicting future frames of a video is challenging because it is difficult to learn the uncertainty of the underlying factors influencing their contents. In this paper, we propose a novel video prediction model, which has infinite-dimensional latent variables over the spatio-temporal domain. Specifically, we first decompose the video motion and content information, then take a neural stochastic differential equation to predict the temporal motion information, and finally, an image diffusion model autoregressively generates the video frame by conditioning on the predicted motion feature and the previous frame. The better expressiveness and stronger stochasticity learning capability of our model lead to state-of-the-art video prediction performances. As well, our model is able to achieve temporal continuous prediction, i.e., predicting in an unsupervised way the future video frames with an arbitrarily high frame rate. Our code is available at \url{this https URL}.

Abstract (translated)

预测视频的未来帧是一个具有挑战性的任务,因为很难学习影响其内容的潜在因素的不确定性。在本文中,我们提出了一个新颖的视频预测模型,该模型在时空域具有无限维的潜在变量。具体来说,我们首先将视频运动和内容信息分解,然后使用神经随机微分方程预测时间运动信息,最后,一个图像扩散模型通过条件生成视频帧,该模型基于预测的运动特征和前帧。我们模型的表达力和随机性学习能力使其达到最先进的视频预测性能。此外,我们的模型还能够实现时间连续预测,即以任意高帧率为自监督方式预测未来的视频帧。我们的代码可在此处下载。

URL

https://arxiv.org/abs/2312.06486

PDF

https://arxiv.org/pdf/2312.06486.pdf
Read All
Rethinking Urban Mobility Prediction: A Super-Multivariate Time Series Forecasting Approach

2023-12-04 07:39:05

Jinguo Cheng, Ke Li, Yuxuan Liang, Lijun Sun, Junchi Yan, Yuankai Wu

arXiv_AI

arXiv_AI CNN Attention Relation Prediction Transformer Video_Prediction
Abstract

Long-term urban mobility predictions play a crucial role in the effective management of urban facilities and services. Conventionally, urban mobility data has been structured as spatiotemporal videos, treating longitude and latitude grids as fundamental pixels. Consequently, video prediction methods, relying on Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), have been instrumental in this domain. In our research, we introduce a fresh perspective on urban mobility prediction. Instead of oversimplifying urban mobility data as traditional video data, we regard it as a complex multivariate time series. This perspective involves treating the time-varying values of each grid in each channel as individual time series, necessitating a thorough examination of temporal dynamics, cross-variable correlations, and frequency-domain insights for precise and reliable predictions. To address this challenge, we present the Super-Multivariate Urban Mobility Transformer (SUMformer), which utilizes a specially designed attention mechanism to calculate temporal and cross-variable correlations and reduce computational costs stemming from a large number of time series. SUMformer also employs low-frequency filters to extract essential information for long-term predictions. Furthermore, SUMformer is structured with a temporal patch merge mechanism, forming a hierarchical framework that enables the capture of multi-scale correlations. Consequently, it excels in urban mobility pattern modeling and long-term prediction, outperforming current state-of-the-art methods across three real-world datasets.

Abstract (translated)

长期城市流动性预测在有效管理城市设施和服务方面发挥着关键作用。通常，城市流动性数据被结构化为栅格状的时空视频，将经度纬度网格视为基本像素。因此，依赖于卷积神经网络（CNNs）和视觉变换器（ViTs）的视频预测方法在领域内发挥了重要作用。在我们的研究中，我们提出了一个关于城市流动性预测的新视角。我们没有将城市流动性数据简单地视为传统视频数据，而是将其视为一个复杂的多变量时间序列。这个观点包括将每个通道中每个网格的时间变化值视为独立的时间序列，这就需要对时间动态、跨变量相关性和频域见解进行深入的审查，以确保精确和可靠的预测。为了解决这个挑战，我们提出了超级多维城市流动性变换器（SUMformer），它利用专门设计的注意力机制计算时空和相关性，并降低由于大量时间序列而产生的计算成本。SUMformer还采用低频滤波器来提取长期预测所需的关键信息。此外，SUMformer采用时间补丁合并机制，形成了一个层次结构，可以捕捉多尺度相关性。因此，它在城市流动性模式建模和长期预测方面表现出色，超越了当前最先进的方法，在三个真实世界数据集上取得了较好的成绩。

URL

https://arxiv.org/abs/2312.01699

PDF

https://arxiv.org/pdf/2312.01699.pdf
Read All
ViP-Mixer: A Convolutional Mixer for Video Prediction

2023-11-20 11:28:18

Xin Zheng, Ziang Peng, Yuan Cao, Hongming Shan, Junping Zhang

arXiv_AI

arXiv_AI CNN Relation Prediction Pose 3D Video_Prediction
Abstract

Video prediction aims to predict future frames from a video's previous content. Existing methods mainly process video data where the time dimension mingles with the space and channel dimensions from three distinct angles: as a sequence of individual frames, as a 3D volume in spatiotemporal coordinates, or as a stacked image where frames are treated as separate channels. Most of them generally focus on one of these perspectives and may fail to fully exploit the relationships across different dimensions. To address this issue, this paper introduces a convolutional mixer for video prediction, termed ViP-Mixer, to model the spatiotemporal evolution in the latent space of an autoencoder. The ViP-Mixers are stacked sequentially and interleave feature mixing at three levels: frames, channels, and locations. Extensive experiments demonstrate that our proposed method achieves new state-of-the-art prediction performance on three benchmark video datasets covering both synthetic and real-world scenarios.

Abstract (translated)

视频预测旨在预测视频的先前内容中的未来帧。现有的方法主要处理时间维度与空间和通道维度从三个不同角度混合的视频数据：作为一系列单独的帧，作为时空坐标中的3D体积，或作为堆叠图像，其中帧被视为独立的通道。大多数方法通常集中于其中的一个角度，并可能无法充分利用不同维度之间的关系。为了解决这个问题，本文提出了一种用于视频预测的卷积混合器，称为ViP-Mixer，以建模自编码器中潜在空间的时间和空间演化。ViP-Mixer逐层堆叠，并在帧、通道和位置三个级别 interleave 特征混合。大量实验证明，我们提出的方法在覆盖 both synthetic 和 real-world scenario 的三个基准视频数据集上实现了最先进的预测性能。

URL

https://arxiv.org/abs/2311.11683

PDF

https://arxiv.org/pdf/2311.11683.pdf
Read All
Pair-wise Layer Attention with Spatial Masking for Video Prediction

2023-11-19 10:29:05

Ping Li, Chenhan Zhang, Zheng Yang, Xianghua Xu, Mingli Song

arXiv_CV

arXiv_CV Attention Prediction Pose Autonomous Video_Prediction
Abstract

Video prediction yields future frames by employing the historical frames and has exhibited its great potential in many applications, e.g., meteorological prediction, and autonomous driving. Previous works often decode the ultimate high-level semantic features to future frames without texture details, which deteriorates the prediction quality. Motivated by this, we develop a Pair-wise Layer Attention (PLA) module to enhance the layer-wise semantic dependency of the feature maps derived from the U-shape structure in Translator, by coupling low-level visual cues and high-level features. Hence, the texture details of predicted frames are enriched. Moreover, most existing methods capture the spatiotemporal dynamics by Translator, but fail to sufficiently utilize the spatial features of Encoder. This inspires us to design a Spatial Masking (SM) module to mask partial encoding features during pretraining, which adds the visibility of remaining feature pixels by Decoder. To this end, we present a Pair-wise Layer Attention with Spatial Masking (PLA-SM) framework for video prediction to capture the spatiotemporal dynamics, which reflect the motion trend. Extensive experiments and rigorous ablation studies on five benchmarks demonstrate the advantages of the proposed approach. The code is available at GitHub.

Abstract (translated)

视频预测通过利用历史帧并展示其在许多应用中的巨大潜力，例如气象预测和自动驾驶。之前的工作通常将最终高级语义特征解码为未来帧，而没有纹理细节，这导致了预测质量的下降。为了激励这种，我们开发了一个Pair-wise Layer Attention（PLA）模块，通过结合低级视觉提示和高级特征，增强从U形结构中提取的特征图的层间语义依赖关系。因此，预测帧的纹理细节得到丰富。此外，大多数现有方法通过Translator捕捉了语义动态，但未能充分利用编码器的空间特征。这启发了我们设计一个Spatial Masking（SM）模块，在预训练期间遮盖部分编码特征，通过Decoder增加剩余特征像素的可视性。为此，我们提出了一个PLA-SM框架用于视频预测，以捕捉运动趋势。在五个基准测试中进行广泛实验和严谨的消融研究证明了所提出方法的优势。代码可以在GitHub上找到。

URL

https://arxiv.org/abs/2311.11289

PDF

https://arxiv.org/pdf/2311.11289.pdf
Read All
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

2023-10-31 17:58:17

Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, Ziwei Liu

arXiv_CV

arXiv_CV Prediction Pose Video_Prediction Diffusion
Abstract

Recently video generation has achieved substantial progress with realistic results. Nevertheless, existing AI-generated videos are usually very short clips ("shot-level") depicting a single scene. To deliver a coherent long video ("story-level"), it is desirable to have creative transition and prediction effects across different clips. This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction. The goal is to generate high-quality long videos with smooth and creative transitions between scenes and varying lengths of shot-level videos. Specifically, we propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions. By providing the images of different scenes as inputs, combined with text-based control, our model generates transition videos that ensure coherence and visual quality. Furthermore, the model can be readily extended to various tasks such as image-to-video animation and autoregressive video prediction. To conduct a comprehensive evaluation of this new generative task, we propose three assessing criteria for smooth and creative transition: temporal consistency, semantic similarity, and video-text semantic alignment. Extensive experiments validate the effectiveness of our approach over existing methods for generative transition and prediction, enabling the creation of story-level long videos. Project page: this https URL .

Abstract (translated)

近年来，视频生成取得了实质性进展，并实现了真实的效果。然而，现有的AI生成的视频通常都很短，仅描绘了一个场景。为了生成连贯的长视频（故事层），需要在不同的片段之间实现创意的过渡和预测效果。本文介绍了一个短到长视频扩散模型SEINE，它专注于生成过渡和预测。目标是生成高质量的长视频，场景之间的过渡和 shot-level 视频的长度不同。具体来说，我们提出了一个基于文本描述的随机掩码视频扩散模型，用于自动生成过渡。通过提供不同场景的图像作为输入，结合文本控制，我们的模型生成确保连贯性和视觉质量的过渡视频。此外，模型还可以扩展到各种任务，如图像到视频动画和自回归视频预测。为了全面评估这项新的生成任务，我们提出了三个评估标准：时间一致性、语义相似性和视频文本语义对齐。大量实验证实了我们的方法在生成过渡和预测方面比现有方法更有效，从而可以创建故事层长视频。项目页面：此链接。

URL

https://arxiv.org/abs/2310.20700

PDF

https://arxiv.org/pdf/2310.20700.pdf
Read All
Variational Inference for SDEs Driven by Fractional Noise

2023-10-19 17:59:21

Rembert Daems, Manfred Opper, Guillaume Crevecoeur, Tolga Birdal

arXiv_AI

arXiv_AI Gradient_Descent Inference Knowledge Prediction Pose Action Video_Prediction Diffusion
Abstract

We present a novel variational framework for performing inference in (neural) stochastic differential equations (SDEs) driven by Markov-approximate fractional Brownian motion (fBM). SDEs offer a versatile tool for modeling real-world continuous-time dynamic systems with inherent noise and randomness. Combining SDEs with the powerful inference capabilities of variational methods, enables the learning of representative function distributions through stochastic gradient descent. However, conventional SDEs typically assume the underlying noise to follow a Brownian motion (BM), which hinders their ability to capture long-term dependencies. In contrast, fractional Brownian motion (fBM) extends BM to encompass non-Markovian dynamics, but existing methods for inferring fBM parameters are either computationally demanding or statistically inefficient. In this paper, building upon the Markov approximation of fBM, we derive the evidence lower bound essential for efficient variational inference of posterior path measures, drawing from the well-established field of stochastic analysis. Additionally, we provide a closed-form expression to determine optimal approximation coefficients. Furthermore, we propose the use of neural networks to learn the drift, diffusion and control terms within our variational posterior, leading to the variational training of neural-SDEs. In this framework, we also optimize the Hurst index, governing the nature of our fractional noise. Beyond validation on synthetic data, we contribute a novel architecture for variational latent video prediction,-an approach that, to the best of our knowledge, enables the first variational neural-SDE application to video perception.

Abstract (translated)

我们提出了一个新的用于(神经)随机微分方程(SDE)驱动的马尔可夫近似分数布朗运动(fBM)进行推理的框架。SDE提供了一个灵活的工具来建模具有固有噪声和随机性的真实世界连续时间动态系统。将SDE与变分方法的力量相结合,可以通过随机梯度下降学习具有代表性的函数分布。然而,传统的SDE通常假设底层噪声遵循布朗运动(BM),这阻碍了它们捕捉长期依赖关系的能力。相比之下,分数布朗运动(fBM)扩展了BM,涵盖了非马尔可夫过程,但现有的对fBM参数进行推断的方法要么计算复杂度高,要么统计效率低。在本文中,我们在fBM的马尔可夫近似的基础上,推导出用于有效进行后验路径测量的证据下界,并从随机分析领域得到 established 的结论。此外,我们还提供了确定最优逼近系数的 closed-form 表达式。另外,我们提出使用神经网络来学习我们随后的变分后验中的漂移、扩散和控制项,从而实现神经-SDE的变分训练。在这个框架中,我们还优化了Hurst指数,控制了我们的分数噪声的本质。除了在合成数据上的验证外,我们还为变分随机场预测提供了一种新的架构,这种方法,据我们所知,是第一个变分神经-SDE应用视频感知。

URL

https://arxiv.org/abs/2310.12975

PDF

https://arxiv.org/pdf/2310.12975.pdf
Read All
Connected Autonomous Vehicle Motion Planning with Video Predictions from Smart, Self-Supervised Infrastructure

2023-09-14 08:15:31

Jiankai Sun, Shreyas Kousik, David Fridovich-Keil, Mac Schwager

arXiv_AI

arXiv_AI Prediction Pose Autonomous Action Self-Supervised Video_Prediction Agent
Abstract

Connected autonomous vehicles (CAVs) promise to enhance safety, efficiency, and sustainability in urban transportation. However, this is contingent upon a CAV correctly predicting the motion of surrounding agents and planning its own motion safely. Doing so is challenging in complex urban environments due to frequent occlusions and interactions among many agents. One solution is to leverage smart infrastructure to augment a CAV's situational awareness; the present work leverages a recently proposed "Self-Supervised Traffic Advisor" (SSTA) framework of smart sensors that teach themselves to generate and broadcast useful video predictions of road users. In this work, SSTA predictions are modified to predict future occupancy instead of raw video, which reduces the data footprint of broadcast predictions. The resulting predictions are used within a planning framework, demonstrating that this design can effectively aid CAV motion planning. A variety of numerical experiments study the key factors that make SSTA outputs useful for practical CAV planning in crowded urban environments.

Abstract (translated)

联网自主车辆(CAV)承诺提高城市交通的安全性、效率和可持续性。然而,这需要CAV准确预测周围车辆的移动并安全地规划自己的移动。在复杂的城市环境中,由于许多车辆的频繁阻挡和交互,这样做很困难。一种解决方案是利用智能基础设施来提高CAV的环境意识;目前的研究利用最近提出的智能传感器提出的“自主交通顾问”(SSTA)框架,这些传感器自我学习生成和广播道路使用者有用的视频预测。在本研究中,SSTA的预测被修改,以预测未来的使用情况,而不是原始视频,从而减少了广播预测的数据足迹。结果的预测被用于规划框架中,这表明这种方法可以有效地帮助CAV移动规划,在拥挤的城市环境中提高CAV的使用效率。多种数值实验研究了使SSTA输出在拥挤城市环境中对CAV规划有用的关键因素。

URL

https://arxiv.org/abs/2309.07504

PDF

https://arxiv.org/pdf/2309.07504.pdf
Read All
MMVP: Motion-Matrix-based Video Prediction

2023-08-30 17:20:46

Yiqi Zhong, Luming Liang, Ilya Zharkov, Ulrich Neumann

arXiv_CV

arXiv_CV Prediction Video_Prediction
Abstract

A central challenge of video prediction lies where the system has to reason the objects' future motions from image frames while simultaneously maintaining the consistency of their appearances across frames. This work introduces an end-to-end trainable two-stream video prediction framework, Motion-Matrix-based Video Prediction (MMVP), to tackle this challenge. Unlike previous methods that usually handle motion prediction and appearance maintenance within the same set of modules, MMVP decouples motion and appearance information by constructing appearance-agnostic motion matrices. The motion matrices represent the temporal similarity of each and every pair of feature patches in the input frames, and are the sole input of the motion prediction module in MMVP. This design improves video prediction in both accuracy and efficiency, and reduces the model size. Results of extensive experiments demonstrate that MMVP outperforms state-of-the-art systems on public data sets by non-negligible large margins (about 1 db in PSNR, UCF Sports) in significantly smaller model sizes (84% the size or smaller). Please refer to this https URL for the official code and the datasets used in this paper.

Abstract (translated)

视频预测的一个核心挑战是在同时保持图像帧之间外观一致性的情况下,从图像帧中推理物体的未来运动。这项工作介绍了一种端到端可训练的双向视频预测框架,即运动矩阵视频预测(MMVP),以解决这个挑战。与以前的方法通常在同一模块内处理运动预测和外观维护不同,MMVP通过构建外观无关的运动矩阵将运动和外观信息分离。运动矩阵代表输入帧中每个特征 patch 之间的时间相似性,是MMVP 运动预测模块的唯一输入。这个设计提高了视频预测的准确性和效率,并减小了模型大小。广泛的实验结果表明,MMVP在公共数据集上比最先进的系统在小模型规模下显著表现更好(PSNR:约1 db,UCF Sports:84%大小或更小)。请参考此httpsURL以查看官方代码和本文使用的数据集。

URL

https://arxiv.org/abs/2308.16154

PDF

https://arxiv.org/pdf/2308.16154.pdf
Read All
Future Video Prediction from a Single Frame for Video Anomaly Detection

2023-08-15 14:04:50

Mohammad Baradaran, Robert Bergevin

arXiv_CV

arXiv_CV Segmentation Semantic_Segmentation Detection Attention Prediction Video_Prediction
Abstract

Video anomaly detection (VAD) is an important but challenging task in computer vision. The main challenge rises due to the rarity of training samples to model all anomaly cases. Hence, semi-supervised anomaly detection methods have gotten more attention, since they focus on modeling normals and they detect anomalies by measuring the deviations from normal patterns. Despite impressive advances of these methods in modeling normal motion and appearance, long-term motion modeling has not been effectively explored so far. Inspired by the abilities of the future frame prediction proxy-task, we introduce the task of future video prediction from a single frame, as a novel proxy-task for video anomaly detection. This proxy-task alleviates the challenges of previous methods in learning longer motion patterns. Moreover, we replace the initial and future raw frames with their corresponding semantic segmentation map, which not only makes the method aware of object class but also makes the prediction task less complex for the model. Extensive experiments on the benchmark datasets (ShanghaiTech, UCSD-Ped1, and UCSD-Ped2) show the effectiveness of the method and the superiority of its performance compared to SOTA prediction-based VAD methods.

Abstract (translated)

视频异常检测(VAD)是计算机视觉中一个重要的但具有挑战性的任务。其主要挑战源于训练样本不足以模型所有异常案例的罕见性。因此,半监督异常检测方法越来越受到关注,因为它们专注于建模正常情况,并通过测量与正常模式 Deviations 的差异来检测异常。尽管这些方法在建模正常运动和外观方面取得了令人印象深刻的进步,但到目前为止,长期运动建模还没有得到 effectively 的探索。受到未来帧预测代理任务的能力启发,我们引入了从单个帧预测未来视频的任务,并将其作为视频异常检测中的新型代理任务。这个代理任务可以减轻以前方法在学习更长的运动模式方面的挑战。此外,我们替换了初始和未来的 raw 帧及其相应的语义分割地图,这不仅使方法能够识别物体类别,还使模型的预测任务变得更加简单。在基准数据集(ShanghaiTech、UCCSD-Ped1 和 UCSD-Ped2)上进行广泛的实验表明,这种方法的有效性和与 SOTA 基于预测的 VAD 方法相比的性能优越性。

URL

https://arxiv.org/abs/2308.07783

PDF

https://arxiv.org/pdf/2308.07783.pdf
Read All
S-HR-VQVAE: Sequential Hierarchical Residual Learning Vector Quantized Variational Autoencoder for Video Prediction

2023-07-13 11:58:27

Mohammad Adiban, Kalin Stefanov, Sabato Marco Siniscalchi, Giampiero Salvi

arXiv_CV

arXiv_CV Prediction Quantitative Pose Action Video_Prediction
Abstract

We address the video prediction task by putting forth a novel model that combines (i) our recently proposed hierarchical residual vector quantized variational autoencoder (HR-VQVAE), and (ii) a novel spatiotemporal PixelCNN (ST-PixelCNN). We refer to this approach as a sequential hierarchical residual learning vector quantized variational autoencoder (S-HR-VQVAE). By leveraging the intrinsic capabilities of HR-VQVAE at modeling still images with a parsimonious representation, combined with the ST-PixelCNN's ability at handling spatiotemporal information, S-HR-VQVAE can better deal with chief challenges in video prediction. These include learning spatiotemporal information, handling high dimensional data, combating blurry prediction, and implicit modeling of physical characteristics. Extensive experimental results on the KTH Human Action and Moving-MNIST tasks demonstrate that our model compares favorably against top video prediction techniques both in quantitative and qualitative evaluations despite a much smaller model size. Finally, we boost S-HR-VQVAE by proposing a novel training method to jointly estimate the HR-VQVAE and ST-PixelCNN parameters.

Abstract (translated)

我们解决了视频预测任务,提出了一种新模型,它结合了我们最近提出的分层残留向量量化Variational Autoencoder (HR-VQVAE)和一种新的时间空间像素卷积神经网络 (ST-PixelCNN)。我们称之为顺序分层残留向量量化Variational Autoencoder (S-HR-VQVAE)。通过利用HR-VQVAE在简单表示下建模静态图像的固有能力,并结合ST-PixelCNN在处理时间空间信息方面的能力,S-HR-VQVAE能够更好地处理视频预测的主要挑战。这些挑战包括学习时间空间信息、处理高维数据、对抗模糊的预测和隐含形态学的特征建模。在KTH人类行动和移动MNIST任务的实验结果表明,尽管模型规模较小,但在定量和定性评估中与最先进的视频预测技术进行比较时表现良好。最后,我们提出了一种新训练方法, jointly estimate HR-VQVAE和ST-PixelCNN参数,以提升S-HR-VQVAE。

URL

https://arxiv.org/abs/2307.06701

PDF

https://arxiv.org/pdf/2307.06701.pdf
Read All

Content

Video_Prediction (20)

Video_Prediction

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL