Generating videos in the first-person perspective has broad application prospects in the field of augmented reality and embodied intelligence. In this work, we explore the cross-view video prediction task, where given an exo-centric video, the first frame of the corresponding ego-centric video, and textual instructions, the goal is to generate futur frames of the ego-centric video. Inspired by the notion that hand-object interactions (HOI) in ego-centric videos represent the primary intentions and actions of the current actor, we present EgoExo-Gen that explicitly models the hand-object dynamics for cross-view video prediction. EgoExo-Gen consists of two stages. First, we design a cross-view HOI mask prediction model that anticipates the HOI masks in future ego-frames by modeling the spatio-temporal ego-exo correspondence. Next, we employ a video diffusion model to predict future ego-frames using the first ego-frame and textual instructions, while incorporating the HOI masks as structural guidance to enhance prediction quality. To facilitate training, we develop an automated pipeline to generate pseudo HOI masks for both ego- and exo-videos by exploiting vision foundation models. Extensive experiments demonstrate that our proposed EgoExo-Gen achieves better prediction performance compared to previous video prediction models on the Ego-Exo4D and H2O benchmark datasets, with the HOI masks significantly improving the generation of hands and interactive objects in the ego-centric videos.
从第一人称视角生成视频在增强现实和具身智能领域具有广泛的应用前景。在这项工作中,我们探讨了跨视图视频预测任务:给定一个外向型(exo-centric)的视频、相应的第一人称(ego-centric)视频的第一个帧以及文本指令,目标是根据这些信息生成后续第一人称视角下的视频帧。 鉴于手部与物体互动(Hand-Object Interaction, HOI)在第一人称视频中代表了当前操作者的首要意图和动作,我们提出了EgoExo-Gen模型,该模型明确地建模了手部与物体之间的动态关系以进行跨视图视频预测。EgoExo-Gen包含两个阶段: 1. 我们设计了一个跨视图HOI掩码预测模型,通过建模空间和时间上的第一人称和外向型视角对应关系来预估未来第一人称帧的HOI掩码。 2. 接下来,我们采用视频扩散模型根据给定的第一人称第一个帧和文本指令来预测未来的第一人称帧,并结合HOI掩码作为结构指导以增强生成的质量。 为了促进训练过程,我们开发了一种自动化管线,利用视觉基础模型为ego-视频和exo-视频自动生成伪HOI掩码。大量的实验表明,我们的EgoExo-Gen模型在Ego-Exo4D和H2O基准数据集上相比之前的视频预测模型具有更好的预测性能,并且通过引入HOI掩码显著提高了第一人称视角下手部及互动物体的生成质量。
https://arxiv.org/abs/2504.11732
Video prediction (VP) generates future frames by leveraging spatial representations and temporal context from past frames. Traditional recurrent neural network (RNN)-based models enhance memory cell structures to capture spatiotemporal states over extended durations but suffer from gradual loss of object appearance details. To address this issue, we propose the strong recollection VP (SRVP) model, which integrates standard attention (SA) and reinforced feature attention (RFA) modules. Both modules employ scaled dot-product attention to extract temporal context and spatial correlations, which are then fused to enhance spatiotemporal representations. Experiments on three benchmark datasets demonstrate that SRVP mitigates image quality degradation in RNN-based models while achieving predictive performance comparable to RNN-free architectures.
视频预测(VP)通过利用过去帧的空间表示和时间上下文来生成未来的画面。传统的基于递归神经网络(RNN)的模型通过增强记忆单元结构以捕捉长时间跨度内的时空状态,但随着时间推移会逐渐丧失物体外观细节。为了解决这一问题,我们提出了强回忆视频预测(SRVP)模型,该模型集成了标准注意力(SA)和强化特征注意(RFA)模块。这两个模块均采用缩放点积注意力机制来提取时间上下文和空间相关性,并将这些信息融合起来以增强时空表示。在三个基准数据集上的实验表明,SRVP能够在不使用RNN的情况下同时减轻基于RNN的模型中的图像质量下降问题,并达到与无RNN架构相当的预测性能。
https://arxiv.org/abs/2504.08012
Model Predictive Control (MPC) is a widely adopted control paradigm that leverages predictive models to estimate future system states and optimize control inputs accordingly. However, while MPC excels in planning and control, it lacks the capability for environmental perception, leading to failures in complex and unstructured scenarios. To address this limitation, we introduce Vision-Language Model Predictive Control (VLMPC), a robotic manipulation planning framework that integrates the perception power of vision-language models (VLMs) with MPC. VLMPC utilizes a conditional action sampling module that takes a goal image or language instruction as input and leverages VLM to generate candidate action sequences. These candidates are fed into a video prediction model that simulates future frames based on the actions. In addition, we propose an enhanced variant, Traj-VLMPC, which replaces video prediction with motion trajectory generation to reduce computational complexity while maintaining accuracy. Traj-VLMPC estimates motion dynamics conditioned on the candidate actions, offering a more efficient alternative for long-horizon tasks and real-time applications. Both VLMPC and Traj-VLMPC select the optimal action sequence using a VLM-based hierarchical cost function that captures both pixel-level and knowledge-level consistency between the current observation and the task input. We demonstrate that both approaches outperform existing state-of-the-art methods on public benchmarks and achieve excellent performance in various real-world robotic manipulation tasks. Code is available at this https URL.
模型预测控制(MPC)是一种广泛应用的控制范式,它利用预测模型来估计未来的系统状态,并据此优化控制输入。然而,尽管MPC在规划和控制方面表现出色,但它缺乏环境感知能力,在复杂和无结构场景中容易出现失败情况。为了解决这一局限性,我们引入了视觉-语言模型预测控制(VLMPC),这是一种机器人操作规划框架,它将视觉-语言模型(VLM)的感知能力与MPC相结合。VLMPC使用一个条件动作采样模块,该模块以目标图像或语言指令作为输入,并利用VLM生成候选的动作序列。这些候选动作被送入视频预测模型中,后者基于这些动作模拟未来的帧。此外,我们还提出了一种增强变体——Traj-VLMPC,它用运动轨迹生成替代视频预测,从而在保持准确性的同时减少计算复杂度。Traj-VLMPC根据候选动作估计运动动力学,为长时任务和实时应用提供了更高效的替代方案。VLMPC和Traj-VLMPC都使用基于VLM的分层成本函数来选择最优的动作序列,该函数捕捉当前观察与任务输入之间的像素级和知识级一致性。我们证明了这两种方法在公共基准测试中优于现有的最先进方法,并在各种现实世界机器人操作任务中表现出色。代码可以在提供的URL获取。
https://arxiv.org/abs/2504.05225
We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner. This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch that satisfies the independently and identically distributed (IID) sample assumption expected by conventional training paradigms. When videos are only available as a continuous stream of input, the IID assumption is evidently broken, leading to poor performance. We demonstrate the drop in performance when moving from shuffled to sequential learning on three tasks: the one-video representation learning method DoRA, standard VideoMAE on multi-video datasets, and the task of future video prediction. To address this drop, we propose a geometric modification to standard optimizers, to decorrelate batches by utilising orthogonal gradients during training. The proposed modification can be applied to any optimizer -- we demonstrate it with Stochastic Gradient Descent (SGD) and AdamW. Our proposed orthogonal optimizer allows models trained from streaming videos to alleviate the drop in representation learning performance, as evaluated on downstream tasks. On three scenarios (DoRA, VideoMAE, future prediction), we show our orthogonal optimizer outperforms the strong AdamW in all three scenarios.
我们解决的是从连续视频流中以自监督方式学习表示的挑战。这与标准的视频学习方法不同,后者在训练过程中将视频片段打乱并重新排序,以便创建一个不冗余且满足独立同分布(IID)样本假设的小批量数据集。当仅能获得连续输入视频时,IID假设显然被破坏,导致性能下降。我们通过三个任务展示了从打乱学习到顺序学习的性能下降:单个视频表示学习方法DoRA、标准VideoMAE在多视频数据集上的应用以及未来视频预测任务。 为了应对这一问题,我们提出了对标准优化器进行几何修改的方法,利用正交梯度在训练过程中去相关化批次。所提出的修改可以应用于任何优化器——我们在随机梯度下降(SGD)和AdamW上进行了演示。我们的正交优化器使从流式视频中训练的模型能够缓解表示学习性能下降的问题,并通过下游任务进行评估。在三个场景(DoRA、VideoMAE、未来预测)下,我们展示了我们的正交优化器在所有三种情况下均优于强大的AdamW。
https://arxiv.org/abs/2504.01961
Transmission latency significantly affects users' quality of experience in real-time interaction and actuation. As latency is principally inevitable, video prediction can be utilized to mitigate the latency and ultimately enable zero-latency transmission. However, most of the existing video prediction methods are computationally expensive and impractical for real-time applications. In this work, we therefore propose real-time video prediction towards the zero-latency interaction over networks, called IFRVP (Intermediate Feature Refinement Video Prediction). Firstly, we propose three training methods for video prediction that extend frame interpolation models, where we utilize a simple convolution-only frame interpolation network based on IFRNet. Secondly, we introduce ELAN-based residual blocks into the prediction models to improve both inference speed and accuracy. Our evaluations show that our proposed models perform efficiently and achieve the best trade-off between prediction accuracy and computational speed among the existing video prediction methods. A demonstration movie is also provided at this http URL.
传输延迟显著影响用户在实时互动和操作中的体验质量。由于延迟不可避免,可以利用视频预测技术来减少延迟,并最终实现零延迟传输。然而,现有的大多数视频预测方法计算成本高昂,在实时应用中不切实际。为此,我们提出了一种名为 IFRVP(中间特征细化视频预测)的方法,旨在通过网络实现零延迟互动。 首先,我们为视频预测提出了三种训练方法,这些方法基于帧插值模型进行扩展,并使用了一个简单的仅包含卷积的帧插值网络,该网络基于IFRNet。其次,我们在预测模型中引入了基于ELAN(Efficient Lightweight Architecture Network)的残差块,以同时提高推理速度和准确性。 我们的评估表明,所提出的模型在现有的视频预测方法中表现出色,在预测准确性和计算速度之间实现了最佳权衡。此外,我们还提供了一个展示电影,地址为[此处请手动访问提供的URL]。 这个研究的目标是通过有效的视频预测技术来减少网络传输中的延迟问题,并且提高用户实时互动的体验质量。
https://arxiv.org/abs/2503.23185
Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts. Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion and may not capture long-range temporal dependencies in dynamic scenes. To address this gap, we propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks, i.e., sequences of corresponding points across frames. By incorporating these motion cues, the Tracktention Layer enhances temporal alignment and effectively handles complex object motions, maintaining consistent feature representations over time. Our approach is computationally efficient and can be seamlessly integrated into existing models, such as Vision Transformers, with minimal modification. It can be used to upgrade image-only models to state-of-the-art video ones, sometimes outperforming models natively designed for video prediction. We demonstrate this on video depth prediction and video colorization, where models augmented with the Tracktention Layer exhibit significantly improved temporal consistency compared to baselines.
时间一致性在视频预测中至关重要,以确保输出的连贯性和无瑕疵。传统方法,如时序注意力和3D卷积,在处理显著的对象运动以及捕捉动态场景中的长程时序依赖关系方面可能会遇到困难。为了解决这一缺口,我们提出了Tracktention层,这是一个新颖的架构组件,它通过点轨迹(即帧间对应的点序列)显式地整合了运动信息。通过引入这些运动线索,Tracktention 层增强了时序对齐,并有效地处理复杂的对象运动,在整个时间段内保持一致的功能表示。我们的方法计算效率高,可以无缝集成到现有的模型中,如视觉变换器(Vision Transformers),只需进行少量修改即可。它可以用于将仅基于图像的模型升级为最先进的视频模型,在某些情况下甚至超过了原生设计用于视频预测的模型性能。我们在视频深度预测和视频上色任务中展示了这一点,其中增强有 Tracktention 层的模型在时间一致性方面显著优于基线模型。
https://arxiv.org/abs/2503.19904
The integration of geometric reconstruction and generative modeling remains a critical challenge in developing AI systems capable of human-like spatial reasoning. This paper proposes Aether, a unified framework that enables geometry-aware reasoning in world models by jointly optimizing three core capabilities: (1) 4D dynamic reconstruction, (2) action-conditioned video prediction, and (3) goal-conditioned visual planning. Through task-interleaved feature learning, Aether achieves synergistic knowledge sharing across reconstruction, prediction, and planning objectives. Building upon video generation models, our framework demonstrates unprecedented synthetic-to-real generalization despite never observing real-world data during training. Furthermore, our approach achieves zero-shot generalization in both action following and reconstruction tasks, thanks to its intrinsic geometric modeling. Remarkably, even without real-world data, its reconstruction performance far exceeds that of domain-specific models. Additionally, Aether leverages a geometry-informed action space to seamlessly translate predictions into actions, enabling effective autonomous trajectory planning. We hope our work inspires the community to explore new frontiers in physically-reasonable world modeling and its applications.
几何重建与生成模型的集成仍然是开发能够进行类似人类空间推理的人工智能系统的关键挑战。本文提出了Aether,这是一个统一框架,通过联合优化三种核心能力来实现对世界模型具有几何意识的推理:(1)4D动态重建;(2)基于行动条件的视频预测;以及(3)基于目标条件的视觉规划。借助任务交错特征学习,Aether实现了跨重建、预测和规划目标的知识协同共享。在构建于视频生成模型的基础上,我们的框架即使从未接触过真实世界的数据,在合成到现实世界的泛化能力上也表现出前所未有的水平。此外,由于其内在的几何建模,我们的方法在遵循动作任务以及重建任务中都实现了零样本泛化(zero-shot generalization)。特别值得注意的是,在没有实际数据的情况下,它的重建性能远超领域特定模型。此外,Aether利用一个基于几何信息的动作空间来无缝地将预测转化为行动,从而实现有效的自主轨迹规划。我们希望我们的工作能够激励社区探索物理合理世界建模及其应用的新前沿。
https://arxiv.org/abs/2503.18945
Predicting future video frames is essential for decision-making systems, yet RGB frames alone often lack the information needed to fully capture the underlying complexities of the real world. To address this limitation, we propose a multi-modal framework for Synchronous Video Prediction (SyncVP) that incorporates complementary data modalities, enhancing the richness and accuracy of future predictions. SyncVP builds on pre-trained modality-specific diffusion models and introduces an efficient spatio-temporal cross-attention module to enable effective information sharing across modalities. We evaluate SyncVP on standard benchmark datasets, such as Cityscapes and BAIR, using depth as an additional modality. We furthermore demonstrate its generalization to other modalities on SYNTHIA with semantic information and ERA5-Land with climate data. Notably, SyncVP achieves state-of-the-art performance, even in scenarios where only one modality is present, demonstrating its robustness and potential for a wide range of applications.
预测未来的视频帧对于决策系统至关重要,然而仅依靠RGB图像通常无法充分捕捉现实世界的复杂性。为了解决这一局限性,我们提出了一种多模态框架——同步视频预测(SyncVP),该框架整合了互补的数据模式,增强了对未来预测的丰富性和准确性。SyncVP基于预训练的特定模式扩散模型,并引入了一个高效的时空交叉注意力模块,以促进不同模式间的信息共享。我们在标准基准数据集上评估了SyncVP的效果,例如Cityscapes和BAIR,在这些数据集中使用深度信息作为额外模态。此外,我们还展示了它在SYNTHIA(带有语义信息)和ERA5-Land(带有气候数据)上的跨其他模式的泛化能力。值得注意的是,即使在仅存在单一模态的情况下,SyncVP也实现了最先进的性能,这表明其鲁棒性和广泛应用潜力。
https://arxiv.org/abs/2503.18933
Ensuring the safety and well-being of elderly and vulnerable populations in assisted living environments is a critical concern. Computer vision presents an innovative and powerful approach to predicting health risks through video monitoring, employing human action recognition (HAR) technology. However, real-time prediction of human actions with high performance and efficiency is a challenge. This research proposes a real-time human action recognition model that combines a deep learning model and a live video prediction and alert system, in order to predict falls, staggering and chest pain for residents in assisted living. Six thousand RGB video samples from the NTU RGB+D 60 dataset were selected to create a dataset with four classes: Falling, Staggering, Chest Pain, and Normal, with the Normal class comprising 40 daily activities. Transfer learning technique was applied to train four state-of-the-art HAR models on a GPU server, namely, UniFormerV2, TimeSformer, I3D, and SlowFast. Results of the four models are presented in this paper based on class-wise and macro performance metrics, inference efficiency, model complexity and computational costs. TimeSformer is proposed for developing the real-time human action recognition model, leveraging its leading macro F1 score (95.33%), recall (95.49%), and precision (95.19%) along with significantly higher inference throughput compared to the others. This research provides insights to enhance safety and health of the elderly and people with chronic illnesses in assisted living environments, fostering sustainable care, smarter communities and industry innovation.
确保在辅助生活环境中老年人和弱势群体的安全与健康是至关重要的问题。计算机视觉通过视频监控提供了一种创新且强大的方法,利用人体行为识别(HAR)技术来预测健康风险。然而,实现高性能、高效率的实时人体动作预测是一个挑战。本研究提出了一种结合深度学习模型和实时视频预测及警报系统的实时人体行为识别模型,旨在预测辅助生活环境中居民的跌倒、踉跄和胸痛情况。从NTU RGB+D 60数据集中选择了六千个RGB视频样本,构建了一个包含四类(跌倒、踉跄、胸痛和正常)的数据集,其中“正常”类别包括40种日常活动。采用迁移学习技术在GPU服务器上训练了四种最先进的HAR模型:UniFormerV2、TimeSformer、I3D 和 SlowFast。本文根据各类别的性能指标(如F1分数)、宏观性能指标以及推理效率和模型复杂度及计算成本,展示了这四个模型的结果。基于其卓越的宏观F1得分(95.33%)、召回率(95.49%)和准确率(95.19%),并具有显著高于其他模型的推理吞吐量,TimeSformer被提议用于开发实时人体行为识别模型。这项研究为提高辅助生活环境中老年人及慢性病患者的安全与健康提供了见解,并促进了可持续护理、智能社区以及行业创新的发展。
https://arxiv.org/abs/2503.18957
Text-video prediction (TVP) is a downstream video generation task that requires a model to produce subsequent video frames given a series of initial video frames and text describing the required motion. In practice TVP methods focus on a particular category of videos depicting manipulations of objects carried out by human beings or robot arms. Previous methods adapt models pre-trained on text-to-image tasks, and thus tend to generate video that lacks the required continuity. A natural progression would be to leverage more recent pre-trained text-to-video (T2V) models. This approach is rendered more challenging by the fact that the most common fine-tuning technique, low-rank adaptation (LoRA), yields undesirable results. In this work, we propose an adaptation-based strategy we label Frame-wise Conditioning Adaptation (FCA). Within the module, we devise a sub-module that produces frame-wise text embeddings from the input text, which acts as an additional text condition to aid generation. We use FCA to fine-tune the T2V model, which incorporates the initial frame(s) as an extra condition. We compare and discuss the more effective strategy for injecting such embeddings into the T2V model. We conduct extensive ablation studies on our design choices with quantitative and qualitative performance analysis. Our approach establishes a new state-of-the-art for the task of TVP. The project page is at this https URL .
文本到视频预测(TVP)是一项下游视频生成任务,要求模型在给定一系列初始视频帧和描述所需运动的文本的情况下,产生后续的视频帧。实践中,TVP 方法主要关注一类特定类型的视频,这些视频展示了由人类或机械臂操作的对象操纵过程。先前的方法通常会调整那些预先训练于文本到图像任务的模型,因而生成的视频往往缺乏所需的连贯性。一种自然的发展方向是利用最近预训练的文本到视频(T2V)模型。然而,这种做法由于最常用的微调技术——低秩适应(LoRA)产生了不理想的结果而变得更加具有挑战性。 在本项工作中,我们提出了一种基于调整策略的方法,称为帧级条件适配(FCA)。在该模块内,我们设计了一个子模块,它能够从输入文本中生成每一帧的文本嵌入,这一额外的文字条件有助于视频生成。我们将使用FCA对T2V模型进行微调,并将其初始帧作为附加条件纳入其中。我们会比较和讨论将此类嵌入注入T2V模型中的更有效策略。我们进行了广泛的消融研究来评估设计选择的定量和定性性能。 我们的方法为TVP任务建立了新的最先进水平,项目页面在此[URL]。
https://arxiv.org/abs/2503.12953
Recent years, weather forecasting has gained significant attention. However, accurately predicting weather remains a challenge due to the rapid variability of meteorological data and potential teleconnections. Current spatiotemporal forecasting models primarily rely on convolution operations or sliding windows for feature extraction. These methods are limited by the size of the convolutional kernel or sliding window, making it difficult to capture and identify potential teleconnection features in meteorological data. Additionally, weather data often involve non-rigid bodies, whose motion processes are accompanied by unpredictable deformations, further complicating the forecasting task. In this paper, we propose the GMG model to address these two core challenges. The Global Focus Module, a key component of our model, enhances the global receptive field, while the Motion Guided Module adapts to the growth or dissipation processes of non-rigid bodies. Through extensive evaluations, our method demonstrates competitive performance across various complex tasks, providing a novel approach to improving the predictive accuracy of complex spatiotemporal data.
近年来,天气预报受到了越来越多的关注。然而,由于气象数据的快速变化和潜在的远程联系(teleconnections),准确预测天气仍然是一个挑战。目前的空间时间预测模型主要依赖于卷积操作或滑动窗口来进行特征提取。这些方法受到卷积核大小或滑动窗口大小的限制,难以捕捉并识别气象数据中的潜在远程联系特征。此外,天气数据通常涉及非刚体物体,其运动过程伴随着不可预知的变形,进一步增加了预测任务的复杂性。 在本文中,我们提出了GMG模型来解决上述两个核心挑战。该模型的关键组成部分是全局关注模块(Global Focus Module),它增强了全局感受野;另一个重要部分是运动引导模块(Motion Guided Module),它可以适应非刚体物体的成长或消散过程。通过广泛的评估,我们的方法在各种复杂任务中展示了竞争性的性能,为提高复杂空间时间数据的预测准确性提供了一种新颖的方法。
https://arxiv.org/abs/2503.11297
Training visual reinforcement learning (RL) in practical scenarios presents a significant challenge, $\textit{i.e.,}$ RL agents suffer from low sample efficiency in environments with variations. While various approaches have attempted to alleviate this issue by disentanglement representation learning, these methods usually start learning from scratch without prior knowledge of the world. This paper, in contrast, tries to learn and understand underlying semantic variations from distracting videos via offline-to-online latent distillation and flexible disentanglement constraints. To enable effective cross-domain semantic knowledge transfer, we introduce an interpretable model-based RL framework, dubbed Disentangled World Models (DisWM). Specifically, we pretrain the action-free video prediction model offline with disentanglement regularization to extract semantic knowledge from distracting videos. The disentanglement capability of the pretrained model is then transferred to the world model through latent distillation. For finetuning in the online environment, we exploit the knowledge from the pretrained model and introduce a disentanglement constraint to the world model. During the adaptation phase, the incorporation of actions and rewards from online environment interactions enriches the diversity of the data, which in turn strengthens the disentangled representation learning. Experimental results validate the superiority of our approach on various benchmarks.
在实际场景中训练视觉强化学习(RL)面临重大挑战,即在环境变化的情况下,RL代理的样本效率低下。尽管有许多方法尝试通过解耦表示学习来缓解这一问题,这些方法通常从零开始学习,并不使用世界知识。相比之下,本文试图通过离线到在线潜在蒸馏和灵活的解耦约束从分散视频中学习并理解底层语义变化。为了实现有效的跨域语义知识转移,我们引入了一个可解释的基于模型的RL框架,称为解耦世界模型(DisWM)。具体来说,我们在离线状态下使用带有解耦正则化的无动作视频预测模型进行预训练,以从分散视频中提取语义知识。接着,通过潜在蒸馏将预训练模型的解耦能力转移到世界模型上。在线环境中微调时,我们利用了预训练模型的知识,并向世界模型引入了解耦约束。在适应阶段,结合在线环境交互中的动作和奖励数据丰富了数据多样性,进一步增强了解耦表示学习。实验结果验证了我们的方法在多个基准测试上的优越性。
https://arxiv.org/abs/2503.08751
Diffusion models have emerged as powerful generative frameworks by progressively adding noise to data through a forward process and then reversing this process to generate realistic samples. While these models have achieved strong performance across various tasks and modalities, their application to temporal predictive learning remains underexplored. Existing approaches treat predictive learning as a conditional generation problem, but often fail to fully exploit the temporal dynamics inherent in the data, leading to challenges in generating temporally coherent sequences. To address this, we introduce Dynamical Diffusion (DyDiff), a theoretically sound framework that incorporates temporally aware forward and reverse processes. Dynamical Diffusion explicitly models temporal transitions at each diffusion step, establishing dependencies on preceding states to better capture temporal dynamics. Through the reparameterization trick, Dynamical Diffusion achieves efficient training and inference similar to any standard diffusion model. Extensive experiments across scientific spatiotemporal forecasting, video prediction, and time series forecasting demonstrate that Dynamical Diffusion consistently improves performance in temporal predictive tasks, filling a crucial gap in existing methodologies. Code is available at this repository: this https URL.
扩散模型作为一种强大的生成框架,通过逐步向数据添加噪声的正向过程和反向该过程以生成逼真的样本而兴起。尽管这些模型在各种任务和模态中表现出色,但它们在时间预测学习中的应用尚未得到充分探索。现有方法将预测学习视为条件生成问题,但由于未能充分利用数据内在的时间动态特性,通常难以生成具有连贯性的时间序列。为解决这一问题,我们引入了动力学扩散(DyDiff),这是一种理论依据坚实的框架,它包含了时间感知的正向和反向过程。动力学扩散在每次扩散步骤中显式建模时间转换,并基于先前的状态建立依赖关系,以更好地捕捉时间动态特性。通过重新参数化技巧,动力学扩散实现了类似标准扩散模型的有效训练和推理效率。跨科学时空预测、视频预测以及时间序列预测的广泛实验表明,动力学扩散在时间预测任务中始终提高了性能,填补了现有方法论的一个重要空白。代码可在以下仓库获取:此 https URL 链接。
https://arxiv.org/abs/2503.00951
A unified video and action model holds significant promise for robotics, where videos provide rich scene information for action prediction, and actions provide dynamics information for video prediction. However, effectively combining video generation and action prediction remains challenging, and current video generation-based methods struggle to match the performance of direct policy learning in action accuracy and inference this http URL bridge this gap, we introduce the Unified Video Action model (UVA), which jointly optimizes video and action predictions to achieve both high accuracy and efficient action inference. The key lies in learning a joint video-action latent representation and decoupling video-action decoding. The joint latent representation bridges the visual and action domains, effectively modeling the relationship between video and action sequences. Meanwhile, the decoupled decoding, powered by two lightweight diffusion heads, enables high-speed action inference by bypassing video generation during inference. Such a unified framework further enables versatile functionality through masked input training. By selectively masking actions or videos, a single model can tackle diverse tasks beyond policy learning, such as forward and inverse dynamics modeling and video generation. Via an extensive set of experiments, we demonstrate that UVA can serve as a general-purpose solution for a wide range of robotics tasks, such as policy learning, forward/inverse dynamics and video observation prediction, without compromising performance compared to methods tailored for specific applications. Results are best viewed on this https URL.
一个统一的视频和动作模型在机器人技术中具有巨大的潜力,因为视频提供了丰富的场景信息用于动作预测,而动作则为视频预测提供了动态信息。然而,有效地结合视频生成与动作预测仍然是一项挑战,目前基于视频生成的方法难以在动作准确性及推理效率上与直接策略学习方法相匹敌。为了弥合这一差距,我们引入了统一视频动作模型(UVA),该模型同时优化视频和动作预测,以实现高精度的同时提高行动推断的效率。关键在于学习一个联合的视频-动作潜在表示,并解耦视频-动作解码过程。这种联合潜在表示连接了视觉与行动领域,有效地建模了视频序列与动作序列之间的关系。与此同时,通过两个轻量级扩散头支持的解耦解码,在推断过程中绕过视频生成,使得高速的动作推理成为可能。这样的统一框架还能够通过掩码输入训练实现多功能性。通过对特定的动作或视频进行选择性屏蔽,单一模型可以处理超出策略学习范围的各种任务,例如正向和逆向动力学建模以及视频生成。通过一系列广泛的实验,我们证明了UVA可以作为机器人技术广泛任务(如策略学习、前向/后向动力学及视频观测预测)的通用解决方案,在性能上不逊于专门为特定应用定制的方法。 请注意原文中的“this http URL”是一个链接占位符,可能指向具体的研究成果或实验数据页面。
https://arxiv.org/abs/2503.00200
We investigate the emergence of intuitive physics understanding in general-purpose deep neural network models trained to predict masked regions in natural videos. Leveraging the violation-of-expectation framework, we find that video prediction models trained to predict outcomes in a learned representation space demonstrate an understanding of various intuitive physics properties, such as object permanence and shape consistency. In contrast, video prediction in pixel space and multimodal large language models, which reason through text, achieve performance closer to chance. Our comparisons of these architectures reveal that jointly learning an abstract representation space while predicting missing parts of sensory input, akin to predictive coding, is sufficient to acquire an understanding of intuitive physics, and that even models trained on one week of unique video achieve above chance performance. This challenges the idea that core knowledge -- a set of innate systems to help understand the world -- needs to be hardwired to develop an understanding of intuitive physics.
我们研究了通用深度神经网络模型在训练预测自然视频中被遮挡区域时,直观物理理解的出现。利用违反预期框架,我们发现,在学习表示空间中进行结果预测的视频预测模型展示了对各种直观物理属性的理解,如物体恒常性和形状一致性。相比之下,在像素空间中的视频预测以及通过文本推理的多模态大型语言模型,其表现接近于随机水平。这些架构之间的比较揭示了在预测感觉输入缺失部分的同时共同学习抽象表示空间(类似于预测编码)足以获得对直观物理的理解,并且即使是经过一周独特视频训练的模型也能表现出超过偶然性的性能。这挑战了一种观点,即核心知识——一套帮助理解世界的先天系统——需要被硬连线才能发展出对直观物理的理解。
https://arxiv.org/abs/2502.11831
World models that forecast environmental changes from actions are vital for autonomous driving models with strong generalization. The prevailing driving world model mainly build on video prediction model. Although these models can produce high-fidelity video sequences with advanced diffusion-based generator, they are constrained by their predictive duration and overall generalization capabilities. In this paper, we explore to solve this problem by combining generation loss with MAE-style feature-level context learning. In particular, we instantiate this target with three key design: (1) A more scalable Diffusion Transformer (DiT) structure trained with extra mask construction task. (2) we devise diffusion-related mask tokens to deal with the fuzzy relations between mask reconstruction and generative diffusion process. (3) we extend mask construction task to spatial-temporal domain by utilizing row-wise mask for shifted self-attention rather than masked self-attention in MAE. Then, we adopt a row-wise cross-view module to align with this mask design. Based on above improvement, we propose MaskGWM: a Generalizable driving World Model embodied with Video Mask reconstruction. Our model contains two variants: MaskGWM-long, focusing on long-horizon prediction, and MaskGWM-mview, dedicated to multi-view generation. Comprehensive experiments on standard benchmarks validate the effectiveness of the proposed method, which contain normal validation of Nuscene dataset, long-horizon rollout of OpenDV-2K dataset and zero-shot validation of Waymo dataset. Quantitative metrics on these datasets show our method notably improving state-of-the-art driving world model.
世界模型,用于从行动预测环境变化,在具有强泛化能力的自动驾驶模型中至关重要。当前主流的驾驶世界模型主要建立在视频预测模型基础上。尽管这些模型能够利用先进的扩散生成器产生高保真度的视频序列,但它们受制于预测时长和整体泛化的局限性。本文探索通过结合生成损失与类似MAE(掩码图像建模)特征级上下文学习的方法来解决这一问题。 具体而言,我们采用以下三个关键设计来实现这一目标: 1. 一个更可扩展的Diffusion Transformer (DiT) 结构,并且在训练过程中加入了额外的掩码构建任务。 2. 设计了与扩散过程相关的掩码令牌以处理掩码重建和生成式扩散过程之间的模糊关系。 3. 将掩码构建任务扩展到时空域,通过使用行级掩码进行移位自我注意机制(而不是MAE中的屏蔽自我注意),并采用行级跨视图模块来与此设计对齐。 基于上述改进,我们提出了MaskGWM:一个具备视频掩码重建的可泛化驾驶世界模型。该模型包含两种变体: - MaskGWM-long: 专注于长时预测。 - MaskGWM-mview: 面向多视角生成。 在标准基准上的全面实验验证了所提方法的有效性,包括Nuscene数据集的标准验证、OpenDV-2K数据集的长时滚动和Waymo数据集的零样本验证。这些数据集上的定量指标显示我们的方法显著提升了最先进的驾驶世界模型的表现。
https://arxiv.org/abs/2502.11663
Accurate nowcasting of convective clouds from satellite imagery is essential for mitigating the impacts of meteorological disasters, especially in developing countries and remote regions with limited ground-based observations. Recent advances in deep learning have shown promise in video prediction; however, existing models frequently produce blurry results and exhibit reduced accuracy when forecasting physical fields. Here, we introduce SATcast, a diffusion model that leverages a cascade architecture and multimodal inputs for nowcasting cloud fields in satellite imagery. SATcast incorporates physical fields predicted by FuXi, a deep-learning weather model, alongside past satellite observations as conditional inputs to generate high-quality future cloud fields. Through comprehensive evaluation, SATcast outperforms conventional methods on multiple metrics, demonstrating its superior accuracy and robustness. Ablation studies underscore the importance of its multimodal design and the cascade architecture in achieving reliable predictions. Notably, SATcast maintains predictive skill for up to 24 hours, underscoring its potential for operational nowcasting applications.
从卫星图像准确预测对流云对于减轻气象灾害的影响至关重要,尤其是在地面观测有限的发展中国家和偏远地区。最近在深度学习领域的进展显示出视频预测方面的巨大潜力;然而,现有的模型常常产生模糊的结果,并且在预报物理场时准确性降低。在这里,我们介绍了SATcast,这是一种扩散模型,采用了级联架构并利用多模态输入来对卫星图像中的云层进行临近预报。SATcast结合了深度学习气象模型FuXi预测的物理场以及过去的卫星观测数据作为条件输入,以生成高质量的未来云图。通过全面评估,SATcast在多个指标上优于传统方法,展示了其卓越的准确性和鲁棒性。消融实验强调了其多模态设计和级联架构对于实现可靠预测的重要性。值得注意的是,SATcast能够保持长达24小时的预报技能,突显了它在操作性临近预报中的潜在应用价值。
https://arxiv.org/abs/2502.10957
Predicting future scene representations is a crucial task for enabling robots to understand and interact with the environment. However, most existing methods rely on video sequences and simulations with precise action annotations, limiting their ability to leverage the large amount of available unlabeled video data. To address this challenge, we propose PlaySlot, an object-centric video prediction model that infers object representations and latent actions from unlabeled video sequences. It then uses these representations to forecast future object states and video frames. PlaySlot allows to generate multiple possible futures conditioned on latent actions, which can be inferred from video dynamics, provided by a user, or generated by a learned action policy, thus enabling versatile and interpretable world modeling. Our results show that PlaySlot outperforms both stochastic and object-centric baselines for video prediction across different environments. Furthermore, we show that our inferred latent actions can be used to learn robot behaviors sample-efficiently from unlabeled video demonstrations. Videos and code are available at this https URL.
预测未来场景表示对于使机器人理解和与环境互动至关重要。然而,大多数现有方法依赖于带有精确动作标注的视频序列和模拟数据,这限制了它们利用大量未标记视频数据的能力。为了解决这一挑战,我们提出了PlaySlot,这是一种以物体为中心的视频预测模型,可以从未标记的视频序列中推断出对象表示和潜在的动作。然后,它使用这些表示来预测未来对象状态和视频帧。PlaySlot可以生成基于从视频动态、用户提供的或由学习到的动作策略生成的潜在动作的多个可能的未来场景,从而实现灵活且可解释的世界建模。我们的结果显示,与不同环境下的视频预测基准相比,PlaySlot在随机基线和以物体为中心的模型中都表现出色。此外,我们展示了通过未标记的视频演示,可以利用推断出的潜在动作有效地学习机器人的行为。更多详细信息、视频及代码可在提供的链接中找到。
https://arxiv.org/abs/2502.07600
We propose Heterogeneous Masked Autoregression (HMA) for modeling action-video dynamics to generate high-quality data and evaluation in scaling robot learning. Building interactive video world models and policies for robotics is difficult due to the challenge of handling diverse settings while maintaining computational efficiency to run in real time. HMA uses heterogeneous pre-training from observations and action sequences across different robotic embodiments, domains, and tasks. HMA uses masked autoregression to generate quantized or soft tokens for video predictions. \ourshort achieves better visual fidelity and controllability than the previous robotic video generation models with 15 times faster speed in the real world. After post-training, this model can be used as a video simulator from low-level action inputs for evaluating policies and generating synthetic data. See this link this https URL for more information.
我们提出了一种异构掩码自回归(HMA)方法,用于建模动作视频动态,并生成高质量的数据和评估以扩展机器人学习。由于需要处理不同设置的多样性并保持实时计算效率,为机器人建立互动视频世界模型和策略具有挑战性。HMA利用来自不同机器人实体、领域和任务中的观测数据和行动序列进行异构预训练。HMA通过掩码自回归生成量化或软令牌用于视频预测。 \ourshort 模型比之前的机器人视频生成模型在视觉真实感和控制能力上都有所提高,且其实时运行速度提高了15倍。经过后期训练后,该模型可以作为从低级动作输入生成合成数据并评估策略的视频模拟器使用。更多详情请参阅此链接:[https URL]。 请注意,原文中的“\ourshort”可能是指代某个特定版本或简写形式,在翻译时保留了其原始格式,如果需要进一步解释,请提供上下文信息以便准确处理。
https://arxiv.org/abs/2502.04296
Temporal sequence modeling stands as the fundamental foundation for video prediction systems and real-time forecasting operations as well as anomaly detection applications. The achievement of accurate predictions through efficient resource consumption remains an ongoing issue in contemporary temporal sequence modeling. We introduce the Multi-Attention Unit (MAUCell) which combines Generative Adversarial Networks (GANs) and spatio-temporal attention mechanisms to improve video frame prediction capabilities. Our approach implements three types of attention models to capture intricate motion sequences. A dynamic combination of these attention outputs allows the model to reach both advanced decision accuracy along with superior quality while remaining computationally efficient. The integration of GAN elements makes generated frames appear more true to life therefore the framework creates output sequences which mimic real-world footage. The new design system maintains equilibrium between temporal continuity and spatial accuracy to deliver reliable video prediction. Through a comprehensive evaluation methodology which merged the perceptual LPIPS measurement together with classic tests MSE, MAE, SSIM and PSNR exhibited enhancing capabilities than contemporary approaches based on direct benchmark tests of Moving MNIST, KTH Action, and CASIA-B (Preprocessed) datasets. Our examination indicates that MAUCell shows promise for operational time requirements. The research findings demonstrate how GANs work best with attention mechanisms to create better applications for predicting video sequences.
时间序列建模是视频预测系统、实时预测操作及异常检测应用的基础。在当代时间序列建模中,如何通过高效利用资源实现准确预测仍是一个亟待解决的问题。我们提出了一种结合生成对抗网络(GANs)和时空注意力机制的多注意单元模型(MAUCell),以提升视频帧预测能力。本方法实现了三种类型的注意力模型来捕捉复杂的运动序列。动态组合这些注意力输出使模型能够在保持计算效率的同时,达到高精度决策和高质量结果。引入GAN元素使得生成的帧看起来更逼真,从而使框架能够创建类似于现实世界的输出序列。新的设计系统在时间连续性和空间准确性之间保持平衡,以提供可靠的视频预测。通过结合感知LPIPS测量与经典测试(MSE、MAE、SSIM和PSNR)的全面评估方法表明,在Moving MNIST、KTH Action 和 CASIA-B(预处理)数据集上进行基准测试时,我们的方法比现有方法表现更优。研究表明,MAUCell在运行时间要求上有潜力实现改进。研究结果展示了GANs与注意力机制结合使用以创建更好的视频序列预测应用的优越性。
https://arxiv.org/abs/2501.16997