Recent advances in diffusion transformers have empowered video generation models to generate high-quality video clips from texts or images. However, world models with the ability to predict long-horizon futures from past observations and actions remain underexplored, especially for general-purpose scenarios and various forms of actions. To bridge this gap, we introduce Astra, an interactive general world model that generates real-world futures for diverse scenarios (e.g., autonomous driving, robot grasping) with precise action interactions (e.g., camera motion, robot action). We propose an autoregressive denoising architecture and use temporal causal attention to aggregate past observations and support streaming outputs. We use a noise-augmented history memory to avoid over-reliance on past frames to balance responsiveness with temporal coherence. For precise action control, we introduce an action-aware adapter that directly injects action signals into the denoising process. We further develop a mixture of action experts that dynamically route heterogeneous action modalities, enhancing versatility across diverse real-world tasks such as exploration, manipulation, and camera control. Astra achieves interactive, consistent, and general long-term video prediction and supports various forms of interactions. Experiments across multiple datasets demonstrate the improvements of Astra in fidelity, long-range prediction, and action alignment over existing state-of-the-art world models.
最近的扩散变换器进展使视频生成模型能够从文本或图像中生成高质量的视频片段。然而,具有预测长期未来的能力的世界模型(基于过去观察和行动)在一般用途场景以及各种形式的动作方面仍然较少被探索。为了填补这一空白,我们引入了Astra,这是一个交互式通用世界模型,它为多样化的场景(如自动驾驶、机器人抓取等)生成现实世界的未来,并能够进行精确的动作互动(如相机移动、机器人动作)。我们提出了一个自回归去噪架构,并使用时间因果注意力机制来聚合过去的观察结果并支持流输出。为了不依赖于过去的帧而过度,我们采用了一种噪声增强的历史记忆方法,以平衡响应性与时间一致性。对于精确的动作控制,我们引入了一个感知动作的适配器,它可以直接将动作信号注入去噪过程中。此外,我们开发了一组行动专家混合模型,这些模型能够动态地路由不同形式的动作模态,增强了在探索、操作和相机控制等多样化现实世界任务中的适应性。Astra实现了交互式的长期视频预测,并支持多种形式的互动。跨多个数据集的实验显示了Astra在精确度、长范围预测及动作对齐方面相对于现有最先进的世界模型的进步。
https://arxiv.org/abs/2512.08931
In this work, we investigate diffusion-based video prediction models, which forecast future video frames, for continuous video streams. In this context, the models observe continuously new training samples, and we aim to leverage this to improve their predictions. We thus propose an approach that continuously adapts a pre-trained diffusion model to a video stream. Since fine-tuning the parameters of a large diffusion model is too expensive, we refine the diffusion noise during inference while keeping the model parameters frozen, allowing the model to adaptively determine suitable sampling noise. We term the approach Sequence Adaptive Video Prediction with Diffusion Noise Optimization (SAVi-DNO). To validate our approach, we introduce a new evaluation setting on the Ego4D dataset, focusing on simultaneous adaptation and evaluation on long continuous videos. Empirical results demonstrate improved performance based on FVD, SSIM, and PSNR metrics on long videos of Ego4D and OpenDV-YouTube, as well as videos of UCF-101 and SkyTimelapse, showcasing SAVi-DNO's effectiveness.
https://arxiv.org/abs/2511.18255
Egocentric video generation with fine-grained control through body motion is a key requirement towards embodied AI agents that can simulate, predict, and plan actions. In this work, we propose EgoControl, a pose-controllable video diffusion model trained on egocentric data. We train a video prediction model to condition future frame generation on explicit 3D body pose sequences. To achieve precise motion control, we introduce a novel pose representation that captures both global camera dynamics and articulated body movements, and integrate it through a dedicated control mechanism within the diffusion process. Given a short sequence of observed frames and a sequence of target poses, EgoControl generates temporally coherent and visually realistic future frames that align with the provided pose control. Experimental results demonstrate that EgoControl produces high-quality, pose-consistent egocentric videos, paving the way toward controllable embodied video simulation and understanding.
https://arxiv.org/abs/2511.18173
Generating visual instructions in a given context is essential for developing interactive world simulators. While prior works address this problem through either text-guided image manipulation or video prediction, these tasks are typically treated in isolation. This separation reveals a fundamental issue: image manipulation methods overlook how actions unfold over time, while video prediction models often ignore the intended outcomes. To this end, we propose ShowMe, a unified framework that enables both tasks by selectively activating the spatial and temporal components of video diffusion models. In addition, we introduce structure and motion consistency rewards to improve structural fidelity and temporal coherence. Notably, this unification brings dual benefits: the spatial knowledge gained through video pretraining enhances contextual consistency and realism in non-rigid image edits, while the instruction-guided manipulation stage equips the model with stronger goal-oriented reasoning for video prediction. Experiments on diverse benchmarks demonstrate that our method outperforms expert models in both instructional image and video generation, highlighting the strength of video diffusion models as a unified action-object state transformer.
https://arxiv.org/abs/2511.17839
Video prediction is a fundamental task for various downstream applications, including robotics and world modeling. Although general video prediction models have achieved remarkable performance in standard scenarios, occlusion is still an inherent challenge in video prediction. We hypothesize that providing explicit information about motion (via point-flow) and geometric structure (via depth-maps) will enable video prediction models to perform better in situations with occlusion and the background motion. To investigate this, we present the first systematic study dedicated to occluded video prediction. We use a standard multi-object latent transformer architecture to predict future frames, but modify this to incorporate information from depth and point-flow. We evaluate this model in a controlled setting on both synthetic and real-world datasets with not only appearance-based metrics but also Wasserstein distances on object masks, which can effectively measure the motion distribution of the prediction. We find that when the prediction model is assisted with point flow and depth, it performs better in occluded scenarios and predicts more accurate background motion compared to models without the help of these modalities.
https://arxiv.org/abs/2511.16484
Can we turn a video prediction model into a robot policy? Videos, including those of humans or teleoperated robots, capture rich physical interactions. However, most of them lack labeled actions, which limits their use in robot learning. We present Video Prediction for Robot Actions (ViPRA), a simple pretraining-finetuning framework that learns continuous robot control from these actionless videos. Instead of directly predicting actions, we train a video-language model to predict both future visual observations and motion-centric latent actions, which serve as intermediate representations of scene dynamics. We train these latent actions using perceptual losses and optical flow consistency to ensure they reflect physically grounded behavior. For downstream control, we introduce a chunked flow matching decoder that maps latent actions to robot-specific continuous action sequences, using only 100 to 200 teleoperated demonstrations. This approach avoids expensive action annotation, supports generalization across embodiments, and enables smooth, high-frequency continuous control upto 22 Hz via chunked action decoding. Unlike prior latent action works that treat pretraining as autoregressive policy learning, explicitly models both what changes and how. Our method outperforms strong baselines, with a 16% gain on the SIMPLER benchmark and a 13% improvement across real world manipulation tasks. We will release models and code at this https URL
https://arxiv.org/abs/2511.07732
Behavior cloning methods for robot learning suffer from poor generalization due to limited data support beyond expert demonstrations. Recent approaches leveraging video prediction models have shown promising results by learning rich spatiotemporal representations from large-scale datasets. However, these models learn action-agnostic dynamics that cannot distinguish between different control inputs, limiting their utility for precise manipulation tasks and requiring large pretraining datasets. We propose a Dynamics-Aligned Flow Matching Policy (DAP) that integrates dynamics prediction into policy learning. Our method introduces a novel architecture where policy and dynamics models provide mutual corrective feedback during action generation, enabling self-correction and improved generalization. Empirical validation demonstrates generalization performance superior to baseline methods on real-world robotic manipulation tasks, showing particular robustness in OOD scenarios including visual distractions and lighting variations.
https://arxiv.org/abs/2510.27114
Learning generalizable robotic manipulation policies remains a key challenge due to the scarcity of diverse real-world training data. While recent approaches have attempted to mitigate this through self-supervised representation learning, most either rely on 2D vision pretraining paradigms such as masked image modeling, which primarily focus on static semantics or scene geometry, or utilize large-scale video prediction models that emphasize 2D dynamics, thus failing to jointly learn the geometry, semantics, and dynamics required for effective manipulation. In this paper, we present DynaRend, a representation learning framework that learns 3D-aware and dynamics-informed triplane features via masked reconstruction and future prediction using differentiable volumetric rendering. By pretraining on multi-view RGB-D video data, DynaRend jointly captures spatial geometry, future dynamics, and task semantics in a unified triplane representation. The learned representations can be effectively transferred to downstream robotic manipulation tasks via action value map prediction. We evaluate DynaRend on two challenging benchmarks, RLBench and Colosseum, as well as in real-world robotic experiments, demonstrating substantial improvements in policy success rate, generalization to environmental perturbations, and real-world applicability across diverse manipulation tasks.
https://arxiv.org/abs/2510.24261
Inspired by the performance and scalability of autoregressive large language models (LLMs), transformer-based models have seen recent success in the visual domain. This study investigates a transformer adaptation for video prediction with a simple end-to-end approach, comparing various spatiotemporal self-attention layouts. Focusing on causal modeling of physical simulations over time; a common shortcoming of existing video-generative approaches, we attempt to isolate spatiotemporal reasoning via physical object tracking metrics and unsupervised training on physical simulation datasets. We introduce a simple yet effective pure transformer model for autoregressive video prediction, utilizing continuous pixel-space representations for video prediction. Without the need for complex training strategies or latent feature-learning components, our approach significantly extends the time horizon for physically accurate predictions by up to 50% when compared with existing latent-space approaches, while maintaining comparable performance on common video quality metrics. In addition, we conduct interpretability experiments to identify network regions that encode information useful to perform accurate estimations of PDE simulation parameters via probing models, and find that this generalizes to the estimation of out-of-distribution simulation parameters. This work serves as a platform for further attention-based spatiotemporal modeling of videos via a simple, parameter efficient, and interpretable approach.
受到自回归大型语言模型(LLMs)在性能和可扩展性方面的启发,基于Transformer的模型在视觉领域取得了近期的成功。本研究探讨了一种用于视频预测的Transformer适应方案,并采用简单的端到端方法比较了各种时空自我注意布局。专注于对物理仿真进行因果建模,这通常是现有视频生成方法的一个常见不足之处,我们尝试通过物理对象跟踪指标和在物理仿真数据集上的无监督训练来隔离时空推理。 我们引入了一种简单而有效的纯Transformer模型用于自回归视频预测,该模型利用连续的像素空间表示来进行视频预测。无需复杂的训练策略或潜在特征学习组件,我们的方法与现有的潜在空间方法相比,在物理准确性的预测时间范围上提高了50%以上,同时保持了在常见视频质量指标上的相当性能。 此外,我们进行了可解释性实验以识别网络区域,这些区域通过探针模型编码执行PDE(偏微分方程)仿真参数的精确估计所需的信息,并发现这可以推广到对分布外仿真参数的估计。这项工作提供了一个平台,用于进一步基于注意力机制进行视频的时空建模,采用简单、参数高效且可解释的方法。
https://arxiv.org/abs/2510.20807
Extracting the true dynamical variables of a system from high-dimensional video is challenging due to distracting visual factors such as background motion, occlusions, and texture changes. We propose LyTimeT, a two-phase framework for interpretable variable extraction that learns robust and stable latent representations of dynamical systems. In Phase 1, LyTimeT employs a spatio-temporal TimeSformer-based autoencoder that uses global attention to focus on dynamically relevant regions while suppressing nuisance variation, enabling distraction-robust latent state learning and accurate long-horizon video prediction. In Phase 2, we probe the learned latent space, select the most physically meaningful dimensions using linear correlation analysis, and refine the transition dynamics with a Lyapunov-based stability regularizer to enforce contraction and reduce error accumulation during roll-outs. Experiments on five synthetic benchmarks and four real-world dynamical systems, including chaotic phenomena, show that LyTimeT achieves mutual information and intrinsic dimension estimates closest to ground truth, remains invariant under background perturbations, and delivers the lowest analytical mean squared error among CNN-based (TIDE) and transformer-only baselines. Our results demonstrate that combining spatio-temporal attention with stability constraints yields predictive models that are not only accurate but also physically interpretable.
从高维视频中提取系统的真正动态变量是一个挑战,因为会受到诸如背景运动、遮挡和纹理变化等分散注意力的视觉因素的影响。我们提出了一种名为LyTimeT的两阶段框架,该框架旨在学习鲁棒且稳定的潜在表示来提取可解释的变量。在第一阶段,LyTimeT采用基于时空Transformer(TimeSformer)的自编码器,利用全局注意机制聚焦于动态相关的区域并抑制干扰变化,从而实现抗分心的隐态学习和准确的长时视频预测。 第二阶段中,我们探究已学得的潜在空间,使用线性相关分析选择最具有物理意义的维度,并通过基于李雅普诺夫稳定性正则化的约束来优化转换动力学,以确保收缩并减少在展开过程中的误差累积。实验结果表明,在五个合成基准和四个真实世界动态系统(包括混沌现象)上,LyTimeT能够提供与实际情况最为接近的互信息和固有维度估计值,并且在背景扰动下保持不变性,同时在基于CNN(TIDE)和其他仅基于Transformer的方法中具有最低的解析均方误差。 我们的研究结果表明,结合时空注意力机制与稳定性约束可以生成不仅准确而且物理可解释的预测模型。
https://arxiv.org/abs/2510.19716
Script event induction, which aims to predict the subsequent event based on the context, is a challenging task in NLP, achieving remarkable success in practical applications. However, human events are mostly recorded and presented in the form of videos rather than scripts, yet there is a lack of related research in the realm of vision. To address this problem, we introduce AVEP (Action-centric Video Event Prediction), a task that distinguishes itself from existing video prediction tasks through its incorporation of more complex logic and richer semantic information. We present a large structured dataset, which consists of about $35K$ annotated videos and more than $178K$ video clips of event, built upon existing video event datasets to support this task. The dataset offers more fine-grained annotations, where the atomic unit is represented as a multimodal event argument node, providing better structured representations of video events. Due to the complexity of event structures, traditional visual models that take patches or frames as input are not well-suited for AVEP. We propose EventFormer, a node-graph hierarchical attention based video event prediction model, which can capture both the relationships between events and their arguments and the coreferencial relationships between arguments. We conducted experiments using several SOTA video prediction models as well as LVLMs on AVEP, demonstrating both the complexity of the task and the value of the dataset. Our approach outperforms all these video prediction models. We will release the dataset and code for replicating the experiments and annotations.
https://arxiv.org/abs/2510.21786
Predicting precipitation maps is a highly complex spatiotemporal modeling task, critical for mitigating the impacts of extreme weather events. Short-term precipitation forecasting, or nowcasting, requires models that are not only accurate but also computationally efficient for real-time applications. Current methods, such as token-based autoregressive models, often suffer from flawed inductive biases and slow inference, while diffusion models can be computationally intensive. To address these limitations, we introduce BlockGPT, a generative autoregressive transformer using batched tokenization (Block) method that predicts full two-dimensional fields (frames) at each time step. Conceived as a model-agnostic paradigm for video prediction, BlockGPT factorizes space-time by using self-attention within each frame and causal attention across frames; in this work, we instantiate it for precipitation nowcasting. We evaluate BlockGPT on two precipitation datasets, viz. KNMI (Netherlands) and SEVIR (U.S.), comparing it to state-of-the-art baselines including token-based (NowcastingGPT) and diffusion-based (DiffCast+Phydnet) models. The results show that BlockGPT achieves superior accuracy, event localization as measured by categorical metrics, and inference speeds up to 31x faster than comparable baselines.
预测降水地图是一项高度复杂的时空建模任务,对于减轻极端天气事件的影响至关重要。短期降水预报(即临近预报)需要既准确又计算效率高的模型以支持实时应用。目前的方法,如基于令牌的自回归模型,常常存在推理偏差的问题和较慢的推断速度,而扩散模型则可能过于计算密集。为了克服这些限制,我们引入了BlockGPT,这是一种使用批量标记化方法(Block)的生成式自回归变压器,在每个时间步预测完整的二维字段(帧)。作为一种用于视频预测的模型无关范例,BlockGPT通过在每帧内部使用自我注意机制以及跨帧因果注意力来因子分解时空;在这项工作中,我们将其应用于降水临近预报。我们在两个降水数据集上评估了BlockGPT:KNMI(荷兰)和SEVIR(美国),并与最先进的基线模型进行了比较,包括基于令牌的NowcastingGPT和基于扩散的DiffCast+Phydnet模型。结果表明,BlockGPT在准确性、通过类别指标衡量的事态定位以及推理速度方面优于同类基准模型,速度快达31倍。
https://arxiv.org/abs/2510.06293
World models allow agents to simulate the consequences of actions in imagined environments for planning, control, and long-horizon decision-making. However, existing autoregressive world models struggle with visually coherent predictions due to disrupted spatial structure, inefficient decoding, and inadequate motion modeling. In response, we propose \textbf{S}cale-wise \textbf{A}utoregression with \textbf{M}otion \textbf{P}r\textbf{O}mpt (\textbf{SAMPO}), a hybrid framework that combines visual autoregressive modeling for intra-frame generation with causal modeling for next-frame generation. Specifically, SAMPO integrates temporal causal decoding with bidirectional spatial attention, which preserves spatial locality and supports parallel decoding within each scale. This design significantly enhances both temporal consistency and rollout efficiency. To further improve dynamic scene understanding, we devise an asymmetric multi-scale tokenizer that preserves spatial details in observed frames and extracts compact dynamic representations for future frames, optimizing both memory usage and model performance. Additionally, we introduce a trajectory-aware motion prompt module that injects spatiotemporal cues about object and robot trajectories, focusing attention on dynamic regions and improving temporal consistency and physical realism. Extensive experiments show that SAMPO achieves competitive performance in action-conditioned video prediction and model-based control, improving generation quality with 4.4$\times$ faster inference. We also evaluate SAMPO's zero-shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks and benefit from larger model sizes.
世界模型允许代理在想象的环境中模拟行动后果,从而进行规划、控制和长期决策。然而,现有的自回归世界模型由于空间结构中断、解码效率低以及运动建模不足等问题,在视觉连贯预测方面存在困难。为此,我们提出了**SAMPO**(Scale-wise Autoregression with Motion Prompt)框架,这是一种结合了视觉自回归建模用于帧内生成和因果建模用于下一帧生成的混合架构。具体来说,SAMPO通过整合时序因果解码与双向空间注意力机制,保持了空间局部性,并支持各尺度内的并行解码,从而大大提升了时间一致性和展开效率。 为了进一步提高对动态场景的理解,我们设计了一种不对称多尺度编码器,该编码器在观察帧中保留空间细节,在未来帧中提取紧凑的运动表示。这种策略优化了内存使用和模型性能。此外,我们引入了一个轨迹感知运动提示模块,该模块注入关于对象及机器人轨迹的空间时间线索,使注意力集中在动态区域上,从而提高了时间和物理现实性的一致性。 大量的实验表明,SAMPO在动作条件下的视频预测和基于模型的控制中实现了竞争性的表现,并且比现有方法快4.4倍地提升了生成质量。我们还评估了SAMPO零样本泛化能力及规模效应,证明其能够将学习到的知识迁移到未见过的任务上,并从更大的模型尺寸中获益。
https://arxiv.org/abs/2509.15536
We present Probabilistic Structure Integration (PSI), a system for learning richly controllable and flexibly promptable world models from data. PSI consists of a three-step cycle. The first step, Probabilistic prediction, involves building a probabilistic graphical model Psi of the data, in the form of a random-access autoregressive sequence model. Psi supports a complete set of learned conditional distributions describing the dependence of any variables in the data on any other set of variables. In step 2, Structure extraction, we show how to extract underlying low-dimensional properties in the data, corresponding to a diverse set of meaningful "intermediate structures", in a zero-shot fashion via causal inference on Psi. Step 3, Integration, completes the cycle by converting these structures into new token types that are then continually mixed back into the training diet as conditioning signals and prediction targets. Each such cycle augments the capabilities of Psi, both allowing it to model the underlying data better, and creating new control handles -- akin to an LLM-like universal prompting language. We train an instance of Psi on 1.4 trillion tokens of internet video data; we use it to perform a variety of useful video prediction and understanding inferences; we extract state-of-the-art optical flow, self-supervised depth and object segmentation; and we use these structures to support a full cycle of predictive improvements.
我们提出了概率结构集成(PSI),这是一个从数据中学习丰富可控且灵活可提示的世界模型的系统。PSI由一个三步循环组成。 第一步,概率预测,涉及构建数据的概率图模型Psi,形式为随机访问自回归序列模型。Psi支持一组完整的学习条件分布,描述了数据中的任何变量对其他任意集合变量的依赖关系。 第二步,结构提取,在此步骤中我们展示了如何通过在Psi上进行因果推理来以零样本方式抽取数据中的底层低维属性,对应于一系列有意义的“中间结构”。 第三步,集成,该循环通过将这些结构转换为新的标记类型完成,然后将其不断混合回训练过程中作为条件信号和预测目标。每次这样的循环都会增强Psi的能力,使其更好地建模基础数据,并创建新的控制手柄——类似于类似大型语言模型(LLM)的通用提示语言。 我们在1.4万亿个互联网视频令牌的数据上对Psi的一个实例进行了训练;我们使用它来执行各种有用的视频预测和理解推理;提取最先进的光流、自我监督深度和对象分割技术;并利用这些结构支持整个预测改进循环。
https://arxiv.org/abs/2509.09737
In egocentric scenarios, anticipating both the next action and its visual outcome is essential for understanding human-object interactions and for enabling robotic planning. However, existing paradigms fall short of jointly modeling these aspects. Vision-Language-Action (VLA) models focus on action prediction but lack explicit modeling of how actions influence the visual scene, while video prediction models generate future frames without conditioning on specific actions, often resulting in implausible or contextually inconsistent outcomes. To bridge this gap, we propose a unified two-stage predictive framework that jointly models action and visual future in egocentric scenarios, conditioned on hand trajectories. In the first stage, we perform consecutive state modeling to process heterogeneous inputs (visual observations, language, and action history) and explicitly predict future hand trajectories. In the second stage, we introduce causal cross-attention to fuse multi-modal cues, leveraging inferred action signals to guide an image-based Latent Diffusion Model (LDM) for frame-by-frame future video generation. Our approach is the first unified model designed to handle both egocentric human activity understanding and robotic manipulation tasks, providing explicit predictions of both upcoming actions and their visual consequences. Extensive experiments on Ego4D, BridgeData, and RLBench demonstrate that our method outperforms state-of-the-art baselines in both action prediction and future video synthesis.
在以自我为中心的情境中,预测下一个动作及其视觉结果对于理解人与物体的互动以及使机器人计划成为可能至关重要。然而,现有的范式无法同时建模这些方面。Vision-Language-Action(VLA)模型专注于动作预测,但未能明确地建模动作如何影响视觉场景;而视频预测模型在不基于特定动作的情况下生成未来帧,往往导致结果不合理或与上下文不符。为了弥合这一差距,我们提出了一种统一的两阶段预测框架,在以自我为中心的情境中联合建模动作和视觉未来的模式,并依据手部轨迹进行条件设置。在第一阶段,我们执行连续状态建模来处理异构输入(视觉观察、语言和动作历史),并明确地预测未来的手部轨迹。在第二阶段,我们引入因果交叉注意力机制融合多模态线索,利用推断出的动作信号指导基于图像的潜在扩散模型进行逐帧未来的视频生成。我们的方法是第一个统一模型,旨在同时处理以自我为中心的人类活动理解和机器人操作任务,并明确地预测即将发生的动作及其视觉后果。在Ego4D、BridgeData和RLBench数据集上的广泛实验表明,我们提出的方法在动作预测和未来视频合成方面都优于最先进的基线方法。
https://arxiv.org/abs/2508.19852
The scarcity of manipulation data has motivated the use of pretrained large models from other modalities in robotics. In this work, we build upon autoregressive video generation models to propose a Physical Autoregressive Model (PAR), where physical tokens combine frames and actions to represent the joint evolution of the robot and its environment. PAR leverages the world knowledge embedded in video pretraining to understand physical dynamics without requiring action pretraining, enabling accurate video prediction and consistent action trajectories. It also adopts a DiT-based de-tokenizer to model frames and actions as continuous tokens, mitigating quantization errors and facilitating mutual enhancement. Furthermore, we incorporate a causal mask with inverse kinematics, parallel training, and the KV-cache mechanism to further improve performance and efficiency. Experiments on the ManiSkill benchmark show that PAR achieves a 100\% success rate on the PushCube task, matches the performance of action-pretrained baselines on other tasks, and accurately predicts future videos with tightly aligned action trajectories. These findings underscore a promising direction for robotic manipulation by transferring world knowledge from autoregressive video pretraining.
数据操纵的稀缺性促使了在机器人技术中使用其他模式的预训练大型模型。在这项工作中,我们基于自回归视频生成模型提出了一个物理自回归模型(PAR),其中物理标记结合了帧和动作以表示机器人及其环境的共同演化。PAR利用嵌入在视频预训练中的世界知识来理解物理动态,而无需进行动作预训练,从而实现了准确的视频预测和一致的动作轨迹。它还采用基于DiT的去标记化器将帧和动作作为连续令牌建模,减少了量化误差,并促进了相互增强。此外,我们通过因果掩码、逆向运动学、并行训练以及KV缓存机制来进一步提高性能和效率。 在ManiSkill基准上的实验表明,PAR在PushCube任务上实现了100%的成功率,在其他任务中与动作预训练的基线表现相匹配,并且能够准确预测未来视频,同时确保动作轨迹紧密对齐。这些发现强调了通过自回归视频预训练转移世界知识来促进机器人操纵的一个有希望的方向。
https://arxiv.org/abs/2508.09822
Predicting future video frames is a challenging task with many downstream applications. Previous work has shown that procedural knowledge enables deep models for complex dynamical settings, however their model ViPro assumed a given ground truth initial symbolic state. We show that this approach led to the model learning a shortcut that does not actually connect the observed environment with the predicted symbolic state, resulting in the inability to estimate states given an observation if previous states are noisy. In this work, we add several improvements to ViPro that enables the model to correctly infer states from observations without providing a full ground truth state in the beginning. We show that this is possible in an unsupervised manner, and extend the original Orbits dataset with a 3D variant to close the gap to real world scenarios.
预测未来视频帧是一项具有许多下游应用的挑战性任务。先前的研究表明,程序性知识使深度模型能够在复杂的动态环境中发挥作用,然而它们的模型ViPro假设了一个已知的真实初始符号状态。我们发现这种方法导致了模型学习了一条捷径,这条捷径实际上没有将观察到的环境与预测的符号状态联系起来,结果是无法根据存在噪声的情况下估计出的状态进行预测。 在这项工作中,我们在ViPro的基础上增加了一些改进措施,使模型能够在不提供完整的真实初始状态下正确地从观测中推断状态。我们展示了这种做法可以在无监督的方式下实现,并通过扩展原始的Orbits数据集添加了一个3D变体来缩小与现实世界场景之间的差距。
https://arxiv.org/abs/2508.06335
We present DINO-world, a powerful generalist video world model trained to predict future frames in the latent space of DINOv2. By leveraging a pre-trained image encoder and training a future predictor on a large-scale uncurated video dataset, DINO-world learns the temporal dynamics of diverse scenes, from driving and indoor scenes to simulated environments. We show that DINO-world outperforms previous models on a variety of video prediction benchmarks, e.g. segmentation and depth forecasting, and demonstrates strong understanding of intuitive physics. Furthermore, we show that it is possible to fine-tune the predictor on observation-action trajectories. The resulting action-conditioned world model can be used for planning by simulating candidate trajectories in latent space.
我们介绍了DINO-world,这是一种强大的通用视频世界模型,它在DINOv2的潜在空间中训练以预测未来的帧。通过利用一个预训练的图像编码器,并在一个大规模未经策划的视频数据集上训练未来预测器,DINO-world 学习了各种场景中的时间动态变化,包括驾驶、室内场景以及模拟环境等。我们展示了 DINO-world 在多种视频预测基准测试中(如分割和深度预测)优于先前模型,并且在直观物理的理解方面表现出色。此外,我们还表明可以在观察-行动轨迹上对预测器进行微调。由此产生的基于动作的条件世界模型可以通过在潜在空间中模拟候选轨迹来进行规划。
https://arxiv.org/abs/2507.19468
Predicting future motion trajectories is a critical capability across domains such as robotics, autonomous systems, and human activity forecasting, enabling safer and more intelligent decision-making. This paper proposes a novel, efficient, and lightweight approach for robot action prediction, offering significantly reduced computational cost and inference latency compared to conventional video prediction models. Importantly, it pioneers the adaptation of the InstructPix2Pix model for forecasting future visual frames in robotic tasks, extending its utility beyond static image editing. We implement a deep learning-based visual prediction framework that forecasts what a robot will observe 100 frames (10 seconds) into the future, given a current image and a textual instruction. We repurpose and fine-tune the InstructPix2Pix model to accept both visual and textual inputs, enabling multimodal future frame prediction. Experiments on the RoboTWin dataset (generated based on real-world scenarios) demonstrate that our method achieves superior SSIM and PSNR compared to state-of-the-art baselines in robot action prediction tasks. Unlike conventional video prediction models that require multiple input frames, heavy computation, and slow inference latency, our approach only needs a single image and a text prompt as input. This lightweight design enables faster inference, reduced GPU demands, and flexible multimodal control, particularly valuable for applications like robotics and sports motion trajectory analytics, where motion trajectory precision is prioritized over visual fidelity.
未来运动轨迹的预测在机器人技术、自主系统和人类活动预测等领域中是一项至关重要的能力,它能够支持更安全和智能的决策制定。本文提出了一种新颖、高效且轻量级的方法用于机器人的动作预测,相比传统的视频预测模型,在计算成本和推理延迟方面有显著减少。尤为重要的是,该方法首次将InstructPix2Pix模型应用于机器人任务中的未来视觉帧预测,从而扩展了其在静态图像编辑之外的应用范围。 我们实现了一个基于深度学习的视觉预测框架,能够根据当前的一张图像和一条文本指令来预测100帧(即10秒)后的机器人所观察到的内容。通过重新利用并微调InstructPix2Pix模型以同时接受视觉和文本输入,实现了多模态未来帧预测。 在基于真实场景生成的RoboTWin数据集上的实验表明,我们的方法在机器人的动作预测任务中与现有最佳基线相比,在SSIM(结构相似性指数)和PSNR(峰值信噪比)上表现出更优的结果。不同于传统的视频预测模型需要多个输入帧、计算量大且推理延迟慢的特点,我们提出的方法仅需一张图像和一个文本提示作为输入。 这一轻量级设计不仅实现了更快的推断速度,还减少了对GPU的需求,并提供了灵活多模态控制的可能性,这对于机器人技术以及体育运动轨迹分析等应用场景特别有价值,在这些领域中动作轨迹精度比视觉保真度更为重要。
https://arxiv.org/abs/2507.14809
Bimanual manipulation is crucial in robotics, enabling complex tasks in industrial automation and household services. However, it poses significant challenges due to the high-dimensional action space and intricate coordination requirements. While video prediction has been recently studied for representation learning and control, leveraging its ability to capture rich dynamic and behavioral information, its potential for enhancing bimanual coordination remains underexplored. To bridge this gap, we propose a unified diffusion-based framework for the joint optimization of video and action prediction. Specifically, we propose a multi-frame latent prediction strategy that encodes future states in a compressed latent space, preserving task-relevant features. Furthermore, we introduce a unidirectional attention mechanism where video prediction is conditioned on the action, while action prediction remains independent of video prediction. This design allows us to omit video prediction during inference, significantly enhancing efficiency. Experiments on two simulated benchmarks and a real-world setting demonstrate a significant improvement in the success rate over the strong baseline ACT using our method, achieving a \textbf{24.9\%} increase on ALOHA, an \textbf{11.1\%} increase on RoboTwin, and a \textbf{32.5\%} increase in real-world experiments. Our models and code are publicly available at this https URL.
双臂操作在机器人技术中至关重要,它能够实现工业自动化和家庭服务中的复杂任务。然而,由于高维的动作空间以及复杂的协调要求,这带来了重大挑战。虽然视频预测最近已被研究用于表示学习和控制,并且可以利用其捕捉丰富动态与行为信息的能力,但该方法在增强双臂协作方面的潜力尚未被充分探索。 为了解决这一问题,我们提出了一种基于扩散的统一框架,旨在联合优化视频和动作预测。具体来说,我们提出了一个多帧潜在预测策略,在这种策略中,未来的状态编码在一个压缩的潜在空间内进行,并保留与任务相关的特征信息。此外,我们还引入了一个单向注意力机制,其中视频预测以行动为条件,而动作预测则独立于视频预测。这一设计允许我们在推理过程中省略视频预测,从而显著提升效率。 在两个模拟基准和一个真实世界设置中的实验表明,我们的方法相较于强大的基线方法ACT,在ALOHA上的成功率提高了**24.9%**, 在RoboTwin上提高了**11.1%**, 并且在现实世界的实验中提升了**32.5%**。我们公开发布了模型和代码,详情请访问[此处](https://this https URL)(请注意,链接应为实际存在的地址)。
https://arxiv.org/abs/2507.11296