Can we turn a video prediction model into a robot policy? Videos, including those of humans or teleoperated robots, capture rich physical interactions. However, most of them lack labeled actions, which limits their use in robot learning. We present Video Prediction for Robot Actions (ViPRA), a simple pretraining-finetuning framework that learns continuous robot control from these actionless videos. Instead of directly predicting actions, we train a video-language model to predict both future visual observations and motion-centric latent actions, which serve as intermediate representations of scene dynamics. We train these latent actions using perceptual losses and optical flow consistency to ensure they reflect physically grounded behavior. For downstream control, we introduce a chunked flow matching decoder that maps latent actions to robot-specific continuous action sequences, using only 100 to 200 teleoperated demonstrations. This approach avoids expensive action annotation, supports generalization across embodiments, and enables smooth, high-frequency continuous control upto 22 Hz via chunked action decoding. Unlike prior latent action works that treat pretraining as autoregressive policy learning, explicitly models both what changes and how. Our method outperforms strong baselines, with a 16% gain on the SIMPLER benchmark and a 13% improvement across real world manipulation tasks. We will release models and code at this https URL
https://arxiv.org/abs/2511.07732
Behavior cloning methods for robot learning suffer from poor generalization due to limited data support beyond expert demonstrations. Recent approaches leveraging video prediction models have shown promising results by learning rich spatiotemporal representations from large-scale datasets. However, these models learn action-agnostic dynamics that cannot distinguish between different control inputs, limiting their utility for precise manipulation tasks and requiring large pretraining datasets. We propose a Dynamics-Aligned Flow Matching Policy (DAP) that integrates dynamics prediction into policy learning. Our method introduces a novel architecture where policy and dynamics models provide mutual corrective feedback during action generation, enabling self-correction and improved generalization. Empirical validation demonstrates generalization performance superior to baseline methods on real-world robotic manipulation tasks, showing particular robustness in OOD scenarios including visual distractions and lighting variations.
https://arxiv.org/abs/2510.27114
Learning generalizable robotic manipulation policies remains a key challenge due to the scarcity of diverse real-world training data. While recent approaches have attempted to mitigate this through self-supervised representation learning, most either rely on 2D vision pretraining paradigms such as masked image modeling, which primarily focus on static semantics or scene geometry, or utilize large-scale video prediction models that emphasize 2D dynamics, thus failing to jointly learn the geometry, semantics, and dynamics required for effective manipulation. In this paper, we present DynaRend, a representation learning framework that learns 3D-aware and dynamics-informed triplane features via masked reconstruction and future prediction using differentiable volumetric rendering. By pretraining on multi-view RGB-D video data, DynaRend jointly captures spatial geometry, future dynamics, and task semantics in a unified triplane representation. The learned representations can be effectively transferred to downstream robotic manipulation tasks via action value map prediction. We evaluate DynaRend on two challenging benchmarks, RLBench and Colosseum, as well as in real-world robotic experiments, demonstrating substantial improvements in policy success rate, generalization to environmental perturbations, and real-world applicability across diverse manipulation tasks.
https://arxiv.org/abs/2510.24261
Inspired by the performance and scalability of autoregressive large language models (LLMs), transformer-based models have seen recent success in the visual domain. This study investigates a transformer adaptation for video prediction with a simple end-to-end approach, comparing various spatiotemporal self-attention layouts. Focusing on causal modeling of physical simulations over time; a common shortcoming of existing video-generative approaches, we attempt to isolate spatiotemporal reasoning via physical object tracking metrics and unsupervised training on physical simulation datasets. We introduce a simple yet effective pure transformer model for autoregressive video prediction, utilizing continuous pixel-space representations for video prediction. Without the need for complex training strategies or latent feature-learning components, our approach significantly extends the time horizon for physically accurate predictions by up to 50% when compared with existing latent-space approaches, while maintaining comparable performance on common video quality metrics. In addition, we conduct interpretability experiments to identify network regions that encode information useful to perform accurate estimations of PDE simulation parameters via probing models, and find that this generalizes to the estimation of out-of-distribution simulation parameters. This work serves as a platform for further attention-based spatiotemporal modeling of videos via a simple, parameter efficient, and interpretable approach.
受到自回归大型语言模型(LLMs)在性能和可扩展性方面的启发,基于Transformer的模型在视觉领域取得了近期的成功。本研究探讨了一种用于视频预测的Transformer适应方案,并采用简单的端到端方法比较了各种时空自我注意布局。专注于对物理仿真进行因果建模,这通常是现有视频生成方法的一个常见不足之处,我们尝试通过物理对象跟踪指标和在物理仿真数据集上的无监督训练来隔离时空推理。 我们引入了一种简单而有效的纯Transformer模型用于自回归视频预测,该模型利用连续的像素空间表示来进行视频预测。无需复杂的训练策略或潜在特征学习组件,我们的方法与现有的潜在空间方法相比,在物理准确性的预测时间范围上提高了50%以上,同时保持了在常见视频质量指标上的相当性能。 此外,我们进行了可解释性实验以识别网络区域,这些区域通过探针模型编码执行PDE(偏微分方程)仿真参数的精确估计所需的信息,并发现这可以推广到对分布外仿真参数的估计。这项工作提供了一个平台,用于进一步基于注意力机制进行视频的时空建模,采用简单、参数高效且可解释的方法。
https://arxiv.org/abs/2510.20807
Extracting the true dynamical variables of a system from high-dimensional video is challenging due to distracting visual factors such as background motion, occlusions, and texture changes. We propose LyTimeT, a two-phase framework for interpretable variable extraction that learns robust and stable latent representations of dynamical systems. In Phase 1, LyTimeT employs a spatio-temporal TimeSformer-based autoencoder that uses global attention to focus on dynamically relevant regions while suppressing nuisance variation, enabling distraction-robust latent state learning and accurate long-horizon video prediction. In Phase 2, we probe the learned latent space, select the most physically meaningful dimensions using linear correlation analysis, and refine the transition dynamics with a Lyapunov-based stability regularizer to enforce contraction and reduce error accumulation during roll-outs. Experiments on five synthetic benchmarks and four real-world dynamical systems, including chaotic phenomena, show that LyTimeT achieves mutual information and intrinsic dimension estimates closest to ground truth, remains invariant under background perturbations, and delivers the lowest analytical mean squared error among CNN-based (TIDE) and transformer-only baselines. Our results demonstrate that combining spatio-temporal attention with stability constraints yields predictive models that are not only accurate but also physically interpretable.
从高维视频中提取系统的真正动态变量是一个挑战,因为会受到诸如背景运动、遮挡和纹理变化等分散注意力的视觉因素的影响。我们提出了一种名为LyTimeT的两阶段框架,该框架旨在学习鲁棒且稳定的潜在表示来提取可解释的变量。在第一阶段,LyTimeT采用基于时空Transformer(TimeSformer)的自编码器,利用全局注意机制聚焦于动态相关的区域并抑制干扰变化,从而实现抗分心的隐态学习和准确的长时视频预测。 第二阶段中,我们探究已学得的潜在空间,使用线性相关分析选择最具有物理意义的维度,并通过基于李雅普诺夫稳定性正则化的约束来优化转换动力学,以确保收缩并减少在展开过程中的误差累积。实验结果表明,在五个合成基准和四个真实世界动态系统(包括混沌现象)上,LyTimeT能够提供与实际情况最为接近的互信息和固有维度估计值,并且在背景扰动下保持不变性,同时在基于CNN(TIDE)和其他仅基于Transformer的方法中具有最低的解析均方误差。 我们的研究结果表明,结合时空注意力机制与稳定性约束可以生成不仅准确而且物理可解释的预测模型。
https://arxiv.org/abs/2510.19716
Script event induction, which aims to predict the subsequent event based on the context, is a challenging task in NLP, achieving remarkable success in practical applications. However, human events are mostly recorded and presented in the form of videos rather than scripts, yet there is a lack of related research in the realm of vision. To address this problem, we introduce AVEP (Action-centric Video Event Prediction), a task that distinguishes itself from existing video prediction tasks through its incorporation of more complex logic and richer semantic information. We present a large structured dataset, which consists of about $35K$ annotated videos and more than $178K$ video clips of event, built upon existing video event datasets to support this task. The dataset offers more fine-grained annotations, where the atomic unit is represented as a multimodal event argument node, providing better structured representations of video events. Due to the complexity of event structures, traditional visual models that take patches or frames as input are not well-suited for AVEP. We propose EventFormer, a node-graph hierarchical attention based video event prediction model, which can capture both the relationships between events and their arguments and the coreferencial relationships between arguments. We conducted experiments using several SOTA video prediction models as well as LVLMs on AVEP, demonstrating both the complexity of the task and the value of the dataset. Our approach outperforms all these video prediction models. We will release the dataset and code for replicating the experiments and annotations.
https://arxiv.org/abs/2510.21786
Predicting precipitation maps is a highly complex spatiotemporal modeling task, critical for mitigating the impacts of extreme weather events. Short-term precipitation forecasting, or nowcasting, requires models that are not only accurate but also computationally efficient for real-time applications. Current methods, such as token-based autoregressive models, often suffer from flawed inductive biases and slow inference, while diffusion models can be computationally intensive. To address these limitations, we introduce BlockGPT, a generative autoregressive transformer using batched tokenization (Block) method that predicts full two-dimensional fields (frames) at each time step. Conceived as a model-agnostic paradigm for video prediction, BlockGPT factorizes space-time by using self-attention within each frame and causal attention across frames; in this work, we instantiate it for precipitation nowcasting. We evaluate BlockGPT on two precipitation datasets, viz. KNMI (Netherlands) and SEVIR (U.S.), comparing it to state-of-the-art baselines including token-based (NowcastingGPT) and diffusion-based (DiffCast+Phydnet) models. The results show that BlockGPT achieves superior accuracy, event localization as measured by categorical metrics, and inference speeds up to 31x faster than comparable baselines.
预测降水地图是一项高度复杂的时空建模任务,对于减轻极端天气事件的影响至关重要。短期降水预报(即临近预报)需要既准确又计算效率高的模型以支持实时应用。目前的方法,如基于令牌的自回归模型,常常存在推理偏差的问题和较慢的推断速度,而扩散模型则可能过于计算密集。为了克服这些限制,我们引入了BlockGPT,这是一种使用批量标记化方法(Block)的生成式自回归变压器,在每个时间步预测完整的二维字段(帧)。作为一种用于视频预测的模型无关范例,BlockGPT通过在每帧内部使用自我注意机制以及跨帧因果注意力来因子分解时空;在这项工作中,我们将其应用于降水临近预报。我们在两个降水数据集上评估了BlockGPT:KNMI(荷兰)和SEVIR(美国),并与最先进的基线模型进行了比较,包括基于令牌的NowcastingGPT和基于扩散的DiffCast+Phydnet模型。结果表明,BlockGPT在准确性、通过类别指标衡量的事态定位以及推理速度方面优于同类基准模型,速度快达31倍。
https://arxiv.org/abs/2510.06293
World models allow agents to simulate the consequences of actions in imagined environments for planning, control, and long-horizon decision-making. However, existing autoregressive world models struggle with visually coherent predictions due to disrupted spatial structure, inefficient decoding, and inadequate motion modeling. In response, we propose \textbf{S}cale-wise \textbf{A}utoregression with \textbf{M}otion \textbf{P}r\textbf{O}mpt (\textbf{SAMPO}), a hybrid framework that combines visual autoregressive modeling for intra-frame generation with causal modeling for next-frame generation. Specifically, SAMPO integrates temporal causal decoding with bidirectional spatial attention, which preserves spatial locality and supports parallel decoding within each scale. This design significantly enhances both temporal consistency and rollout efficiency. To further improve dynamic scene understanding, we devise an asymmetric multi-scale tokenizer that preserves spatial details in observed frames and extracts compact dynamic representations for future frames, optimizing both memory usage and model performance. Additionally, we introduce a trajectory-aware motion prompt module that injects spatiotemporal cues about object and robot trajectories, focusing attention on dynamic regions and improving temporal consistency and physical realism. Extensive experiments show that SAMPO achieves competitive performance in action-conditioned video prediction and model-based control, improving generation quality with 4.4$\times$ faster inference. We also evaluate SAMPO's zero-shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks and benefit from larger model sizes.
世界模型允许代理在想象的环境中模拟行动后果,从而进行规划、控制和长期决策。然而,现有的自回归世界模型由于空间结构中断、解码效率低以及运动建模不足等问题,在视觉连贯预测方面存在困难。为此,我们提出了**SAMPO**(Scale-wise Autoregression with Motion Prompt)框架,这是一种结合了视觉自回归建模用于帧内生成和因果建模用于下一帧生成的混合架构。具体来说,SAMPO通过整合时序因果解码与双向空间注意力机制,保持了空间局部性,并支持各尺度内的并行解码,从而大大提升了时间一致性和展开效率。 为了进一步提高对动态场景的理解,我们设计了一种不对称多尺度编码器,该编码器在观察帧中保留空间细节,在未来帧中提取紧凑的运动表示。这种策略优化了内存使用和模型性能。此外,我们引入了一个轨迹感知运动提示模块,该模块注入关于对象及机器人轨迹的空间时间线索,使注意力集中在动态区域上,从而提高了时间和物理现实性的一致性。 大量的实验表明,SAMPO在动作条件下的视频预测和基于模型的控制中实现了竞争性的表现,并且比现有方法快4.4倍地提升了生成质量。我们还评估了SAMPO零样本泛化能力及规模效应,证明其能够将学习到的知识迁移到未见过的任务上,并从更大的模型尺寸中获益。
https://arxiv.org/abs/2509.15536
We present Probabilistic Structure Integration (PSI), a system for learning richly controllable and flexibly promptable world models from data. PSI consists of a three-step cycle. The first step, Probabilistic prediction, involves building a probabilistic graphical model Psi of the data, in the form of a random-access autoregressive sequence model. Psi supports a complete set of learned conditional distributions describing the dependence of any variables in the data on any other set of variables. In step 2, Structure extraction, we show how to extract underlying low-dimensional properties in the data, corresponding to a diverse set of meaningful "intermediate structures", in a zero-shot fashion via causal inference on Psi. Step 3, Integration, completes the cycle by converting these structures into new token types that are then continually mixed back into the training diet as conditioning signals and prediction targets. Each such cycle augments the capabilities of Psi, both allowing it to model the underlying data better, and creating new control handles -- akin to an LLM-like universal prompting language. We train an instance of Psi on 1.4 trillion tokens of internet video data; we use it to perform a variety of useful video prediction and understanding inferences; we extract state-of-the-art optical flow, self-supervised depth and object segmentation; and we use these structures to support a full cycle of predictive improvements.
我们提出了概率结构集成(PSI),这是一个从数据中学习丰富可控且灵活可提示的世界模型的系统。PSI由一个三步循环组成。 第一步,概率预测,涉及构建数据的概率图模型Psi,形式为随机访问自回归序列模型。Psi支持一组完整的学习条件分布,描述了数据中的任何变量对其他任意集合变量的依赖关系。 第二步,结构提取,在此步骤中我们展示了如何通过在Psi上进行因果推理来以零样本方式抽取数据中的底层低维属性,对应于一系列有意义的“中间结构”。 第三步,集成,该循环通过将这些结构转换为新的标记类型完成,然后将其不断混合回训练过程中作为条件信号和预测目标。每次这样的循环都会增强Psi的能力,使其更好地建模基础数据,并创建新的控制手柄——类似于类似大型语言模型(LLM)的通用提示语言。 我们在1.4万亿个互联网视频令牌的数据上对Psi的一个实例进行了训练;我们使用它来执行各种有用的视频预测和理解推理;提取最先进的光流、自我监督深度和对象分割技术;并利用这些结构支持整个预测改进循环。
https://arxiv.org/abs/2509.09737
In egocentric scenarios, anticipating both the next action and its visual outcome is essential for understanding human-object interactions and for enabling robotic planning. However, existing paradigms fall short of jointly modeling these aspects. Vision-Language-Action (VLA) models focus on action prediction but lack explicit modeling of how actions influence the visual scene, while video prediction models generate future frames without conditioning on specific actions, often resulting in implausible or contextually inconsistent outcomes. To bridge this gap, we propose a unified two-stage predictive framework that jointly models action and visual future in egocentric scenarios, conditioned on hand trajectories. In the first stage, we perform consecutive state modeling to process heterogeneous inputs (visual observations, language, and action history) and explicitly predict future hand trajectories. In the second stage, we introduce causal cross-attention to fuse multi-modal cues, leveraging inferred action signals to guide an image-based Latent Diffusion Model (LDM) for frame-by-frame future video generation. Our approach is the first unified model designed to handle both egocentric human activity understanding and robotic manipulation tasks, providing explicit predictions of both upcoming actions and their visual consequences. Extensive experiments on Ego4D, BridgeData, and RLBench demonstrate that our method outperforms state-of-the-art baselines in both action prediction and future video synthesis.
在以自我为中心的情境中,预测下一个动作及其视觉结果对于理解人与物体的互动以及使机器人计划成为可能至关重要。然而,现有的范式无法同时建模这些方面。Vision-Language-Action(VLA)模型专注于动作预测,但未能明确地建模动作如何影响视觉场景;而视频预测模型在不基于特定动作的情况下生成未来帧,往往导致结果不合理或与上下文不符。为了弥合这一差距,我们提出了一种统一的两阶段预测框架,在以自我为中心的情境中联合建模动作和视觉未来的模式,并依据手部轨迹进行条件设置。在第一阶段,我们执行连续状态建模来处理异构输入(视觉观察、语言和动作历史),并明确地预测未来的手部轨迹。在第二阶段,我们引入因果交叉注意力机制融合多模态线索,利用推断出的动作信号指导基于图像的潜在扩散模型进行逐帧未来的视频生成。我们的方法是第一个统一模型,旨在同时处理以自我为中心的人类活动理解和机器人操作任务,并明确地预测即将发生的动作及其视觉后果。在Ego4D、BridgeData和RLBench数据集上的广泛实验表明,我们提出的方法在动作预测和未来视频合成方面都优于最先进的基线方法。
https://arxiv.org/abs/2508.19852
The scarcity of manipulation data has motivated the use of pretrained large models from other modalities in robotics. In this work, we build upon autoregressive video generation models to propose a Physical Autoregressive Model (PAR), where physical tokens combine frames and actions to represent the joint evolution of the robot and its environment. PAR leverages the world knowledge embedded in video pretraining to understand physical dynamics without requiring action pretraining, enabling accurate video prediction and consistent action trajectories. It also adopts a DiT-based de-tokenizer to model frames and actions as continuous tokens, mitigating quantization errors and facilitating mutual enhancement. Furthermore, we incorporate a causal mask with inverse kinematics, parallel training, and the KV-cache mechanism to further improve performance and efficiency. Experiments on the ManiSkill benchmark show that PAR achieves a 100\% success rate on the PushCube task, matches the performance of action-pretrained baselines on other tasks, and accurately predicts future videos with tightly aligned action trajectories. These findings underscore a promising direction for robotic manipulation by transferring world knowledge from autoregressive video pretraining.
数据操纵的稀缺性促使了在机器人技术中使用其他模式的预训练大型模型。在这项工作中,我们基于自回归视频生成模型提出了一个物理自回归模型(PAR),其中物理标记结合了帧和动作以表示机器人及其环境的共同演化。PAR利用嵌入在视频预训练中的世界知识来理解物理动态,而无需进行动作预训练,从而实现了准确的视频预测和一致的动作轨迹。它还采用基于DiT的去标记化器将帧和动作作为连续令牌建模,减少了量化误差,并促进了相互增强。此外,我们通过因果掩码、逆向运动学、并行训练以及KV缓存机制来进一步提高性能和效率。 在ManiSkill基准上的实验表明,PAR在PushCube任务上实现了100%的成功率,在其他任务中与动作预训练的基线表现相匹配,并且能够准确预测未来视频,同时确保动作轨迹紧密对齐。这些发现强调了通过自回归视频预训练转移世界知识来促进机器人操纵的一个有希望的方向。
https://arxiv.org/abs/2508.09822
Predicting future video frames is a challenging task with many downstream applications. Previous work has shown that procedural knowledge enables deep models for complex dynamical settings, however their model ViPro assumed a given ground truth initial symbolic state. We show that this approach led to the model learning a shortcut that does not actually connect the observed environment with the predicted symbolic state, resulting in the inability to estimate states given an observation if previous states are noisy. In this work, we add several improvements to ViPro that enables the model to correctly infer states from observations without providing a full ground truth state in the beginning. We show that this is possible in an unsupervised manner, and extend the original Orbits dataset with a 3D variant to close the gap to real world scenarios.
预测未来视频帧是一项具有许多下游应用的挑战性任务。先前的研究表明,程序性知识使深度模型能够在复杂的动态环境中发挥作用,然而它们的模型ViPro假设了一个已知的真实初始符号状态。我们发现这种方法导致了模型学习了一条捷径,这条捷径实际上没有将观察到的环境与预测的符号状态联系起来,结果是无法根据存在噪声的情况下估计出的状态进行预测。 在这项工作中,我们在ViPro的基础上增加了一些改进措施,使模型能够在不提供完整的真实初始状态下正确地从观测中推断状态。我们展示了这种做法可以在无监督的方式下实现,并通过扩展原始的Orbits数据集添加了一个3D变体来缩小与现实世界场景之间的差距。
https://arxiv.org/abs/2508.06335
We present DINO-world, a powerful generalist video world model trained to predict future frames in the latent space of DINOv2. By leveraging a pre-trained image encoder and training a future predictor on a large-scale uncurated video dataset, DINO-world learns the temporal dynamics of diverse scenes, from driving and indoor scenes to simulated environments. We show that DINO-world outperforms previous models on a variety of video prediction benchmarks, e.g. segmentation and depth forecasting, and demonstrates strong understanding of intuitive physics. Furthermore, we show that it is possible to fine-tune the predictor on observation-action trajectories. The resulting action-conditioned world model can be used for planning by simulating candidate trajectories in latent space.
我们介绍了DINO-world,这是一种强大的通用视频世界模型,它在DINOv2的潜在空间中训练以预测未来的帧。通过利用一个预训练的图像编码器,并在一个大规模未经策划的视频数据集上训练未来预测器,DINO-world 学习了各种场景中的时间动态变化,包括驾驶、室内场景以及模拟环境等。我们展示了 DINO-world 在多种视频预测基准测试中(如分割和深度预测)优于先前模型,并且在直观物理的理解方面表现出色。此外,我们还表明可以在观察-行动轨迹上对预测器进行微调。由此产生的基于动作的条件世界模型可以通过在潜在空间中模拟候选轨迹来进行规划。
https://arxiv.org/abs/2507.19468
Predicting future motion trajectories is a critical capability across domains such as robotics, autonomous systems, and human activity forecasting, enabling safer and more intelligent decision-making. This paper proposes a novel, efficient, and lightweight approach for robot action prediction, offering significantly reduced computational cost and inference latency compared to conventional video prediction models. Importantly, it pioneers the adaptation of the InstructPix2Pix model for forecasting future visual frames in robotic tasks, extending its utility beyond static image editing. We implement a deep learning-based visual prediction framework that forecasts what a robot will observe 100 frames (10 seconds) into the future, given a current image and a textual instruction. We repurpose and fine-tune the InstructPix2Pix model to accept both visual and textual inputs, enabling multimodal future frame prediction. Experiments on the RoboTWin dataset (generated based on real-world scenarios) demonstrate that our method achieves superior SSIM and PSNR compared to state-of-the-art baselines in robot action prediction tasks. Unlike conventional video prediction models that require multiple input frames, heavy computation, and slow inference latency, our approach only needs a single image and a text prompt as input. This lightweight design enables faster inference, reduced GPU demands, and flexible multimodal control, particularly valuable for applications like robotics and sports motion trajectory analytics, where motion trajectory precision is prioritized over visual fidelity.
未来运动轨迹的预测在机器人技术、自主系统和人类活动预测等领域中是一项至关重要的能力,它能够支持更安全和智能的决策制定。本文提出了一种新颖、高效且轻量级的方法用于机器人的动作预测,相比传统的视频预测模型,在计算成本和推理延迟方面有显著减少。尤为重要的是,该方法首次将InstructPix2Pix模型应用于机器人任务中的未来视觉帧预测,从而扩展了其在静态图像编辑之外的应用范围。 我们实现了一个基于深度学习的视觉预测框架,能够根据当前的一张图像和一条文本指令来预测100帧(即10秒)后的机器人所观察到的内容。通过重新利用并微调InstructPix2Pix模型以同时接受视觉和文本输入,实现了多模态未来帧预测。 在基于真实场景生成的RoboTWin数据集上的实验表明,我们的方法在机器人的动作预测任务中与现有最佳基线相比,在SSIM(结构相似性指数)和PSNR(峰值信噪比)上表现出更优的结果。不同于传统的视频预测模型需要多个输入帧、计算量大且推理延迟慢的特点,我们提出的方法仅需一张图像和一个文本提示作为输入。 这一轻量级设计不仅实现了更快的推断速度,还减少了对GPU的需求,并提供了灵活多模态控制的可能性,这对于机器人技术以及体育运动轨迹分析等应用场景特别有价值,在这些领域中动作轨迹精度比视觉保真度更为重要。
https://arxiv.org/abs/2507.14809
Bimanual manipulation is crucial in robotics, enabling complex tasks in industrial automation and household services. However, it poses significant challenges due to the high-dimensional action space and intricate coordination requirements. While video prediction has been recently studied for representation learning and control, leveraging its ability to capture rich dynamic and behavioral information, its potential for enhancing bimanual coordination remains underexplored. To bridge this gap, we propose a unified diffusion-based framework for the joint optimization of video and action prediction. Specifically, we propose a multi-frame latent prediction strategy that encodes future states in a compressed latent space, preserving task-relevant features. Furthermore, we introduce a unidirectional attention mechanism where video prediction is conditioned on the action, while action prediction remains independent of video prediction. This design allows us to omit video prediction during inference, significantly enhancing efficiency. Experiments on two simulated benchmarks and a real-world setting demonstrate a significant improvement in the success rate over the strong baseline ACT using our method, achieving a \textbf{24.9\%} increase on ALOHA, an \textbf{11.1\%} increase on RoboTwin, and a \textbf{32.5\%} increase in real-world experiments. Our models and code are publicly available at this https URL.
双臂操作在机器人技术中至关重要,它能够实现工业自动化和家庭服务中的复杂任务。然而,由于高维的动作空间以及复杂的协调要求,这带来了重大挑战。虽然视频预测最近已被研究用于表示学习和控制,并且可以利用其捕捉丰富动态与行为信息的能力,但该方法在增强双臂协作方面的潜力尚未被充分探索。 为了解决这一问题,我们提出了一种基于扩散的统一框架,旨在联合优化视频和动作预测。具体来说,我们提出了一个多帧潜在预测策略,在这种策略中,未来的状态编码在一个压缩的潜在空间内进行,并保留与任务相关的特征信息。此外,我们还引入了一个单向注意力机制,其中视频预测以行动为条件,而动作预测则独立于视频预测。这一设计允许我们在推理过程中省略视频预测,从而显著提升效率。 在两个模拟基准和一个真实世界设置中的实验表明,我们的方法相较于强大的基线方法ACT,在ALOHA上的成功率提高了**24.9%**, 在RoboTwin上提高了**11.1%**, 并且在现实世界的实验中提升了**32.5%**。我们公开发布了模型和代码,详情请访问[此处](https://this https URL)(请注意,链接应为实际存在的地址)。
https://arxiv.org/abs/2507.11296
Spatio-temporal video prediction plays a pivotal role in critical domains, ranging from weather forecasting to industrial automation. However, in high-precision industrial scenarios such as semiconductor manufacturing, the absence of specialized benchmark datasets severely hampers research on modeling and predicting complex processes. To address this challenge, we make a twofold this http URL, we construct and release the Chip Dicing Lane Dataset (CHDL), the first public temporal image dataset dedicated to the semiconductor wafer dicing process. Captured via an industrial-grade vision system, CHDL provides a much-needed and challenging benchmark for high-fidelity process modeling, defect detection, and digital twin this http URL, we propose DIFFUMA, an innovative dual-path prediction architecture specifically designed for such fine-grained dynamics. The model captures global long-range temporal context through a parallel Mamba module, while simultaneously leveraging a diffusion module, guided by temporal features, to restore and enhance fine-grained spatial details, effectively combating feature degradation. Experiments demonstrate that on our CHDL benchmark, DIFFUMA significantly outperforms existing methods, reducing the Mean Squared Error (MSE) by 39% and improving the Structural Similarity (SSIM) from 0.926 to a near-perfect 0.988. This superior performance also generalizes to natural phenomena datasets. Our work not only delivers a new state-of-the-art (SOTA) model but, more importantly, provides the community with an invaluable data resource to drive future research in industrial AI.
时空视频预测在诸如天气预报和工业自动化等关键领域中扮演着核心角色。然而,在半导体制造等高精度的工业场景中,缺乏专门的基准数据集严重阻碍了对复杂过程建模与预测的研究进展。为解决这一挑战,我们进行了两个方面的努力:首先,构建并发布了用于半导体晶圆切割工艺的第一个公开时间图像数据集——Chip Dicing Lane 数据集(CHDL)。该数据集通过工业级视觉系统获取,提供了高保真度流程建模、缺陷检测和数字孪生所急需的且具有挑战性的基准。其次,我们提出了 DIFFUMA 架构,一种专门设计用于此类精细动态预测的创新双路径架构。DIFFUMA 通过并行的 Mamba 模块捕获全局长时序上下文信息,并同时利用由时间特征引导的扩散模块恢复和增强细粒度空间细节,从而有效对抗特征退化问题。 实验表明,在我们的 CHDL 基准测试中,DIFFUMA 显著优于现有方法,将均方误差(MSE)降低了 39%,并将结构相似性指数(SSIM)从 0.926 提升到了接近完美的 0.988。这一卓越性能也推广到自然现象数据集上。我们的工作不仅提供了新的最先进的(SOTA)模型,而且更重要的是为社区提供了一个宝贵的数据资源,以推动工业 AI 的未来研究进展。
https://arxiv.org/abs/2507.06738
Diffusion models have demonstrated exceptional visual quality in video generation, making them promising for autonomous driving world modeling. However, existing video diffusion-based world models struggle with flexible-length, long-horizon predictions and integrating trajectory planning. This is because conventional video diffusion models rely on global joint distribution modeling of fixed-length frame sequences rather than sequentially constructing localized distributions at each timestep. In this work, we propose Epona, an autoregressive diffusion world model that enables localized spatiotemporal distribution modeling through two key innovations: 1) Decoupled spatiotemporal factorization that separates temporal dynamics modeling from fine-grained future world generation, and 2) Modular trajectory and video prediction that seamlessly integrate motion planning with visual modeling in an end-to-end framework. Our architecture enables high-resolution, long-duration generation while introducing a novel chain-of-forward training strategy to address error accumulation in autoregressive loops. Experimental results demonstrate state-of-the-art performance with 7.4\% FVD improvement and minutes longer prediction duration compared to prior works. The learned world model further serves as a real-time motion planner, outperforming strong end-to-end planners on NAVSIM benchmarks. Code will be publicly available at \href{this https URL}{this https URL}.
扩散模型在视频生成方面展现了出色的视觉质量,因此它们在自主驾驶世界的建模中展现出巨大的潜力。然而,现有的基于视频扩散的世界模型在处理灵活长度和长时预测以及整合轨迹规划方面存在困难。这是因为传统的视频扩散模型依赖于固定长度帧序列的全局联合分布建模,而不是通过每个时间步逐次构建局部化分布。 为此,我们提出了一种名为Epona的自回归扩散世界模型,它通过两个关键创新实现了局部化的时空分布建模:1)解耦的时间空间因子分解,将动态过程建模与细粒度未来世界生成分离;2)模块化轨迹和视频预测,使运动规划能够无缝地与视觉建模相结合,在一个端到端的框架内。我们的架构不仅支持高分辨率、长时间的生成,并且引入了一种新颖的前向训练策略来解决自回归循环中的误差积累问题。 实验结果表明,Epona在FVD指标上提高了7.4%,并且预测时长比先前的工作更长几分钟。此外,所学习的世界模型还可作为实时运动规划器,在NAVSIM基准测试中超越了所有端到端的规划器性能。代码将在 [此链接](this https URL) 上公开提供。
https://arxiv.org/abs/2506.24113
We train models to Predict Ego-centric Video from human Actions (PEVA), given the past video and an action represented by the relative 3D body pose. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view. We train an auto-regressive conditional diffusion transformer on Nymeria, a large-scale dataset of real-world egocentric video and body pose capture. We further design a hierarchical evaluation protocol with increasingly challenging tasks, enabling a comprehensive analysis of the model's embodied prediction and control abilities. Our work represents an initial attempt to tackle the challenges of modeling complex real-world environments and embodied agent behaviors with video prediction from the perspective of a human.
我们训练了一种模型,用于从人类动作预测第一人称视角的视频(PEVA),该模型利用过去视频和由相对3D人体姿态表示的动作来进行工作。通过基于人体关节层次结构约束的动力学姿态轨迹进行条件设定,我们的模型学会了模拟物理人类行为如何以第一人称视角塑造环境。我们在Nymeria上训练了一种自回归条件扩散变换器,Nymeria是一个大规模的真实世界第一人称视频和身体姿态捕捉数据集。我们进一步设计了一个分层评估协议,该协议包含逐渐增加难度的任务,从而能够对模型的具身预测与控制能力进行全面分析。我们的工作代表了首次尝试从人类视角出发通过视频预测来应对模拟复杂现实环境及具身智能体行为挑战的努力。
https://arxiv.org/abs/2506.21552
Video generation models (VGMs) offer a promising pathway for unified world modeling in robotics by integrating simulation, prediction, and manipulation. However, their practical application remains limited due to (1) slowgeneration speed, which limits real-time interaction, and (2) poor consistency between imagined videos and executable actions. To address these challenges, we propose Manipulate in Dream (MinD), a hierarchical diffusion-based world model framework that employs a dual-system design for vision-language manipulation. MinD executes VGM at low frequencies to extract video prediction features, while leveraging a high-frequency diffusion policy for real-time interaction. This architecture enables low-latency, closed-loop control in manipulation with coherent visual guidance. To better coordinate the two systems, we introduce a video-action diffusion matching module (DiffMatcher), with a novel co-training strategy that uses separate schedulers for each diffusion model. Specifically, we introduce a diffusion-forcing mechanism to DiffMatcher that aligns their intermediate representations during training, helping the fast action model better understand video-based predictions. Beyond manipulation, MinD also functions as a world simulator, reliably predicting task success or failure in latent space before execution. Trustworthy analysis further shows that VGMs can preemptively evaluate task feasibility and mitigate risks. Extensive experiments across multiple benchmarks demonstrate that MinD achieves state-of-the-art manipulation (63%+) in RL-Bench, advancing the frontier of unified world modeling in robotics.
视频生成模型(VGM)为机器人统一世界建模提供了一条有前景的路径,通过整合仿真、预测和操作功能。然而,由于以下两个原因,其实际应用仍受到限制:(1) 生成速度慢,这限制了实时互动;(2) 想象中的视频与可执行动作之间的一致性较差。为了解决这些挑战,我们提出了“梦中操控”(MinD),这是一个基于层次化扩散的世界模型框架,采用双系统设计进行视觉-语言操作。MinD以低频执行VGM来提取视频预测特征,同时利用高频扩散策略实现实时互动。这种架构允许在操作过程中通过连贯的视觉引导实现低延迟、闭环控制。 为了更好地协调这两个系统,我们引入了一个视频-动作扩散匹配模块(DiffMatcher),并提出了一种新颖的联合训练策略,该策略为每个扩散模型使用独立的时间表。具体来说,我们在DiffMatcher中引入了扩散强制机制,在训练过程中将它们的中间表示对齐,帮助快速动作模型更好地理解基于视频的预测。 除了操控功能之外,MinD还可以作为一个世界模拟器工作,在执行之前可靠地在潜在空间内预测任务的成功或失败。值得信赖的分析进一步表明,VGM可以预先评估任务可行性并降低风险。 跨多个基准进行的广泛实验显示,MinD在RL-Bench上的操作方面实现了最先进的性能(超过63%),这推进了机器人统一世界建模的前沿领域。
https://arxiv.org/abs/2506.18897
Action-labeled data for robotics is scarce and expensive, limiting the generalization of learned policies. In contrast, vast amounts of action-free video data are readily available, but translating these observations into effective policies remains a challenge. We introduce AMPLIFY, a novel framework that leverages large-scale video data by encoding visual dynamics into compact, discrete motion tokens derived from keypoint trajectories. Our modular approach separates visual motion prediction from action inference, decoupling the challenges of learning what motion defines a task from how robots can perform it. We train a forward dynamics model on abundant action-free videos and an inverse dynamics model on a limited set of action-labeled examples, allowing for independent scaling. Extensive evaluations demonstrate that the learned dynamics are both accurate, achieving up to 3.7x better MSE and over 2.5x better pixel prediction accuracy compared to prior approaches, and broadly useful. In downstream policy learning, our dynamics predictions enable a 1.2-2.2x improvement in low-data regimes, a 1.4x average improvement by learning from action-free human videos, and the first generalization to LIBERO tasks from zero in-distribution action data. Beyond robotic control, we find the dynamics learned by AMPLIFY to be a versatile latent world model, enhancing video prediction quality. Our results present a novel paradigm leveraging heterogeneous data sources to build efficient, generalizable world models. More information can be found at this https URL.
机器人技术中的行动标记数据稀缺且昂贵,这限制了所学策略的泛化能力。相比之下,大量不含动作的数据(如视频)容易获得,但将这些观察结果转化为有效的政策仍然是一项挑战。我们引入了一个新的框架AMPLIFY,该框架利用大规模视频数据通过关键点轨迹来编码视觉动态为紧凑、离散的动作令牌。 我们的方法是模块化的,它分离了视觉运动预测与动作推理的过程,从而解耦了解决什么运动定义一个任务和机器人如何执行这一任务的问题。我们使用大量的无行动标记的视频训练了一个前向动力学模型,并利用少量带有行动标签的例子训练了一个逆向动力学模型,这使得每个模块可以独立地进行扩展。 广泛的评估显示,学习到的动力学既准确(与先前的方法相比,最大均方误差提高了3.7倍,像素预测准确性提高超过2.5倍),又具有广泛的应用价值。在下游策略学习中,我们的动态预测使低数据环境中的性能提升了1.2至2.2倍,在从无动作标记的人类视频中学到时平均提升了1.4倍,并且首次实现了零分布内行动数据下的LIBERO任务的泛化。 除了机器人控制之外,我们发现AMPLIFY学到的动力学是一个多功能的潜在世界模型,可以提高视频预测的质量。我们的结果展示了利用不同数据源构建高效、通用的世界模型的新范式。有关更多信息,请参阅此链接:[此URL](原文中的"this https URL"应替换为具体链接)。
https://arxiv.org/abs/2506.14198