Robotic manipulation requires anticipating how the environment evolves in response to actions, yet most existing systems lack this predictive capability, often resulting in errors and inefficiency. While Vision-Language Models (VLMs) provide high-level guidance, they cannot explicitly forecast future states, and existing world models either predict only short horizons or produce spatially inconsistent frames. To address these challenges, we propose a framework for fast and predictive video-conditioned action. Our approach first selects and adapts a robust video generation model to ensure reliable future predictions, then applies adversarial distillation for fast, few-step video generation, and finally trains an action model that leverages both generated videos and real observations to correct spatial errors. Extensive experiments show that our method produces temporally coherent, spatially accurate video predictions that directly support precise manipulation, achieving significant improvements in embodiment consistency, spatial referring ability, and task completion over existing baselines. Codes & Models will be released.
机器人操作需要预测环境在行动后的演变情况,然而大多数现有系统缺乏这种预测能力,导致经常出现错误和效率低下。虽然视觉-语言模型(VLMs)提供了高层次的指导,但它们无法明确地预测未来的状态,而现有的世界模型要么只能预测短期的时间范围,要么生成的空间上不一致的画面。为了解决这些问题,我们提出了一种快速且具备预测能力的视频条件动作框架。 我们的方法首先选择并调整一个稳健的视频生成模型,以确保可靠的未来预测;然后应用对抗蒸馏(adversarial distillation)进行快速、少步骤的视频生成;最后训练一个利用生成的视频和真实观察来纠正空间错误的动作模型。大量的实验表明,我们提出的方法能够产生时间上连贯且空间上准确的视频预测,直接支持精确的操作,并在具身一致性(embodiment consistency)、空间指代能力以及任务完成度方面显著优于现有的基准方法。 代码与模型将在适当的时候发布。
https://arxiv.org/abs/2602.10717
Predicting physical dynamics from raw visual data remains a major challenge in AI. While recent video generation models have achieved impressive visual quality, they still cannot consistently generate physically plausible videos due to a lack of modeling of physical laws. Recent approaches combining 3D Gaussian splatting and physics engines can produce physically plausible videos, but are hindered by high computational costs in both reconstruction and simulation, and often lack robustness in complex real-world scenarios. To address these issues, we introduce Neural Gaussian Force Field (NGFF), an end-to-end neural framework that integrates 3D Gaussian perception with physics-based dynamic modeling to generate interactive, physically realistic 4D videos from multi-view RGB inputs, achieving two orders of magnitude faster than prior Gaussian simulators. To support training, we also present GSCollision, a 4D Gaussian dataset featuring diverse materials, multi-object interactions, and complex scenes, totaling over 640k rendered physical videos (~4 TB). Evaluations on synthetic and real 3D scenarios show NGFF's strong generalization and robustness in physical reasoning, advancing video prediction towards physics-grounded world models.
https://arxiv.org/abs/2602.00148
We present Akasha 2, a state-of-the-art multimodal architecture that integrates Hamiltonian State Space Duality (H-SSD) with Visual-Language Joint Embedding Predictive Architecture (VL-JEPA). The system leverages the Mamba-3 Selective State Space Model (SSM) augmented by a Sparse Mixture of Hamiltonian Experts (SMoE-HE) that enforces latent physical conservation laws through symplectic integration. For visual synthesis, we introduce Hamiltonian Flow Matching (HFM) and persistent 3D Gaussian Splatting (3DGS), enabling ultra-low latency (<50ms) on mobile hardware. This work establishes a new paradigm in latent world models, achieving unprecedented spatiotemporal coherence through a holographic memory architecture. Our approach demonstrates that incorporating physics-inspired inductive biases into neural architectures yields significant improvements: state-of-the-art video prediction (FVD: 287), 4x faster visual synthesis than diffusion models, and 3-18x inference speedup over transformer baselines while maintaining energy conservation over extended horizons.
我们介绍了Akasha 2,这是一种最先进的多模态架构,将汉密尔顿状态空间对偶性(H-SSD)与视觉语言联合嵌入预测架构(VL-JEPA)结合在一起。该系统利用了由稀疏汉密尔顿专家混合模型(SMoE-HE)增强的Mamba-3选择性状态空间模型(SSM),通过辛几何积分方法强制执行潜在的物理守恒定律。在视觉合成方面,我们引入了汉密尔顿流匹配(HFM)和持久3D高斯点阵(3DGS),这使得在移动硬件上实现极低延迟(<50毫秒)成为可能。这项工作建立了一种新的潜在世界模型范式,通过全息记忆架构实现了前所未有的时空一致性。我们的方法证明,在神经网络架构中加入物理学启发的归纳偏差可以带来显著改进:视频预测达到了最先进的水平(FVD: 287),视觉合成速度是扩散模型的4倍,并且推理速度比Transformer基线快3-18倍,同时还能保持长时间范围内的能量守恒。
https://arxiv.org/abs/2601.06212
Reinforcement learning based post-training paradigms for Video Large Language Models (VideoLLMs) have achieved significant success by optimizing for visual-semantic tasks such as captioning or VideoQA. However, while these approaches effectively enhance perception abilities, they primarily target holistic content understanding, often lacking explicit supervision for intrinsic temporal coherence and inter-frame correlations. This tendency limits the models' ability to capture intricate dynamics and fine-grained visual causality. To explicitly bridge this gap, we propose a novel post-training objective: Masked Video Prediction (MVP). By requiring the model to reconstruct a masked continuous segment from a set of challenging distractors, MVP forces the model to attend to the sequential logic and temporal context of events. To support scalable training, we introduce a scalable data synthesis pipeline capable of transforming arbitrary video corpora into MVP training samples, and further employ Group Relative Policy Optimization (GRPO) with a fine-grained reward function to enhance the model's understanding of video context and temporal properties. Comprehensive evaluations demonstrate that MVP enhances video reasoning capabilities by directly reinforcing temporal reasoning and causal understanding.
基于强化学习的视频大型语言模型(VideoLLM)在视觉语义任务如描述和视频问答中的后训练范式取得了显著成功。然而,尽管这些方法有效提高了感知能力,并且主要针对整体内容理解进行了优化,它们通常缺乏对内在时间连贯性和帧间相关性的显式监督。这种倾向限制了模型捕捉复杂动态变化和细粒度视觉因果关系的能力。 为了明确填补这一缺口,我们提出了一种新的后训练目标:遮罩视频预测(Masked Video Prediction, MVP)。通过要求模型从一系列具有挑战性的干扰项中重构一个被遮盖的连续片段,MVP迫使模型关注事件的顺序逻辑和时间背景。为支持可扩展性训练,我们引入了一个能够将任意视频语料库转换成用于MVP训练样本的数据合成流水线,并进一步采用细粒度奖励函数的分组相对策略优化(Group Relative Policy Optimization, GRPO)来增强模型对视频背景的理解及时间属性。 全面的评估表明,通过直接强化时间推理和因果理解能力,MVP提升了视频推理的能力。
https://arxiv.org/abs/2601.03781
Prevalent Vision-Language-Action (VLA) models are typically built upon Multimodal Large Language Models (MLLMs) and demonstrate exceptional proficiency in semantic understanding, but they inherently lack the capability to deduce physical world dynamics. Consequently, recent approaches have shifted toward World Models, typically formulated via video prediction; however, these methods often suffer from a lack of semantic grounding and exhibit brittleness when handling prediction errors. To synergize semantic understanding with dynamic predictive capabilities, we present InternVLA-A1. This model employs a unified Mixture-of-Transformers architecture, coordinating three experts for scene understanding, visual foresight generation, and action execution. These components interact seamlessly through a unified masked self-attention mechanism. Building upon InternVL3 and Qwen3-VL, we instantiate InternVLA-A1 at 2B and 3B parameter scales. We pre-train these models on hybrid synthetic-real datasets spanning InternData-A1 and Agibot-World, covering over 533M frames. This hybrid training strategy effectively harnesses the diversity of synthetic simulation data while minimizing the sim-to-real gap. We evaluated InternVLA-A1 across 12 real-world robotic tasks and simulation benchmark. It significantly outperforms leading models like pi0 and GR00T N1.5, achieving a 14.5\% improvement in daily tasks and a 40\%-73.3\% boost in dynamic settings, such as conveyor belt sorting.
流行的视觉-语言-行动(VLA)模型通常基于多模态大型语言模型(MLLM),在语义理解方面表现出色,但它们本质上缺乏推断物理世界动态的能力。因此,最近的方法转向了通过视频预测构建的世界模型;然而,这些方法往往缺乏语义基础,并且在处理预测错误时表现得不够稳定。为了将语义理解与动态预测能力相结合,我们提出了InternVLA-A1模型。该模型采用统一的Transformer混合架构,协调三个专家模块进行场景理解、视觉前瞻性生成和动作执行。这三个组件通过统一的掩码自注意力机制无缝互动。基于InternVL3和Qwen3-VL,我们在20亿和30亿参数规模上实例化了InternVLA-A1模型。我们利用混合合成-真实数据集(涵盖InternData-A1和Agibot-World)对这些模型进行了预训练,包括超过5.33亿帧的数据。这种混合训练策略有效利用了仿真数据的多样性,并尽量减少从模拟到现实的差距。我们在12个实际机器人任务及模拟基准测试中评估了InternVLA-A1模型,它显著优于pi0和GR00T N1.5等领先模型,在日常任务上提高了14.5%,在动态场景(如传送带分拣)下提高了40%-73.3%。
https://arxiv.org/abs/2601.02456
World models have become crucial for autonomous driving, as they learn how scenarios evolve over time to address the long-tail challenges of the real world. However, current approaches relegate world models to limited roles: they operate within ostensibly unified architectures that still keep world prediction and motion planning as decoupled processes. To bridge this gap, we propose DriveLaW, a novel paradigm that unifies video generation and motion planning. By directly injecting the latent representation from its video generator into the planner, DriveLaW ensures inherent consistency between high-fidelity future generation and reliable trajectory planning. Specifically, DriveLaW consists of two core components: DriveLaW-Video, our powerful world model that generates high-fidelity forecasting with expressive latent representations, and DriveLaW-Act, a diffusion planner that generates consistent and reliable trajectories from the latent of DriveLaW-Video, with both components optimized by a three-stage progressive training strategy. The power of our unified paradigm is demonstrated by new state-of-the-art results across both tasks. DriveLaW not only advances video prediction significantly, surpassing best-performing work by 33.3% in FID and 1.8% in FVD, but also achieves a new record on the NAVSIM planning benchmark.
世界模型在自动驾驶中变得至关重要,因为它们可以学习环境随时间如何演变,从而解决现实世界中的长尾挑战。然而,目前的方法将世界模型限制在有限的角色上:尽管这些方法旨在建立统一的架构框架,但仍然将世界预测和运动规划视为独立的过程。为了解决这个问题,我们提出了DriveLaW这一新范式,它统一了视频生成与运动规划过程。通过直接从其视频生成器中注入潜在表示到规划器,DriveLaW确保了高保真未来生成与可靠轨迹规划之间的内在一致性。 具体而言,DriveLaW包括两个核心组件:DriveLaW-Video,这是我们的强大世界模型,它能够生成具有表现力的潜在表示的高保真预测;以及DriveLaW-Act,这是一个扩散规划器,可以从DriveLaW-Video的潜在状态中产生一致且可靠的轨迹。这两个组成部分通过一个三阶段渐进式训练策略进行优化。 我们统一范式的强大之处在于其在视频预测和运动规划任务上都取得了新的最先进的成果。特别是在视频预测方面,DriveLaW不仅显著超越了表现最好的工作,在FID(Frechet Inception Distance)指标上领先33.3%,在FVD(Fréchet Video Distance)指标上领先1.8%;同时还在NAVSIM规划基准测试中创造了新的记录。
https://arxiv.org/abs/2512.23421
Motion prediction has been studied in different contexts with models trained on narrow distributions and applied to downstream tasks in human motion prediction and robotics. Simultaneously, recent efforts in scaling video prediction have demonstrated impressive visual realism, yet they struggle to accurately model complex motions despite massive scale. Inspired by the scaling of video generation, we develop autoregressive flow matching (ARFM), a new method for probabilistic modeling of sequential continuous data and train it on diverse video datasets to generate future point track locations over long horizons. To evaluate our model, we develop benchmarks for evaluating the ability of motion prediction models to predict human and robot motion. Our model is able to predict complex motions, and we demonstrate that conditioning robot action prediction and human motion prediction on predicted future tracks can significantly improve downstream task performance. Code and models publicly available at: this https URL.
在不同背景下,人们已经对运动预测进行了研究,并且通过使用窄分布训练的模型应用于下游的人体动作预测和机器人任务。同时,最近在扩展视频预测方面的努力展示了令人印象深刻的视觉逼真度,但它们难以准确地建模复杂的运动,尽管规模庞大。受到视频生成规模化工作的启发,我们开发了一种新的序列连续数据概率建模方法——自回归流匹配(ARFM),并利用多样化的视频数据集对其进行训练,以在长时间范围内生成未来的点轨迹位置。 为了评估我们的模型,我们为评估运动预测模型的性能制定了基准测试标准,用于衡量它们在预测人和机器人动作方面的能力。我们的模型能够预测复杂的运动,并且通过将机器人行为预测和人体动作预测条件化于预测出的未来轨迹上,可以显著提高下游任务的表现。 代码和模型可在以下网址公开获取:[提供的是一个占位符URL "this https URL",请替换为实际链接]。 这段文字介绍了自回归流匹配(ARFM)方法及其在运动预测中的应用,并且说明了通过使用该技术能够提升机器人与人类动作预测的准确性。
https://arxiv.org/abs/2512.22688
Video prediction is plagued by a fundamental trilemma: achieving high-resolution and perceptual quality typically comes at the cost of real-time speed, hindering its use in latency-critical applications. This challenge is most acute for autonomous UAVs in dense urban environments, where foreseeing events from high-resolution imagery is non-negotiable for safety. Existing methods, reliant on iterative generation (diffusion, autoregressive models) or quadratic-complexity attention, fail to meet these stringent demands on edge hardware. To break this long-standing trade-off, we introduce RAPTOR, a video prediction architecture that achieves real-time, high-resolution performance. RAPTOR's single-pass design avoids the error accumulation and latency of iterative approaches. Its core innovation is Efficient Video Attention (EVA), a novel translator module that factorizes spatiotemporal modeling. Instead of processing flattened spacetime tokens with $O((ST)^2)$ or $O(ST)$ complexity, EVA alternates operations along the spatial (S) and temporal (T) axes. This factorization reduces the time complexity to $O(S + T)$ and memory complexity to $O(max(S, T))$, enabling global context modeling at $512^2$ resolution and beyond, operating directly on dense feature maps with a patch-free design. Complementing this architecture is a 3-stage training curriculum that progressively refines predictions from coarse structure to sharp, temporally coherent details. Experiments show RAPTOR is the first predictor to exceed 30 FPS on a Jetson AGX Orin for $512^2$ video, setting a new state-of-the-art on UAVid, KTH, and a custom high-resolution dataset in PSNR, SSIM, and LPIPS. Critically, RAPTOR boosts the mission success rate in a real-world UAV navigation task by 18/%, paving the way for safer and more anticipatory embodied agents.
视频预测面临着一个基本的三难困境:实现高分辨率和感知质量通常会牺牲实时速度,这阻碍了其在延迟关键应用中的使用。对于自主无人机(UAV)在密集城市环境中而言,这一挑战尤为严峻,因为从高清图像中预见事件对安全至关重要。现有方法依赖于迭代生成(如扩散模型、自回归模型)或二次复杂度注意力机制,在边缘硬件上无法满足这些严格要求。为打破这种长期存在的权衡关系,我们引入了RAPTOR架构,它在实现实时和高分辨率性能方面取得了突破。 **RAPTOR的关键特性:** 1. **一次性设计**:避免了迭代方法中的误差累积和延迟问题。 2. **高效视频注意力(EVA)模块**:作为核心创新点,该模块通过对空间(S)和时间(T)轴交替操作来分解时空建模。这种分解将时间复杂度降低至$O(S + T)$,内存复杂度降至$O(\max(S, T))$,从而支持在512^2分辨率及以上的全局上下文模型,并直接处理无块设计的密集特征图。 **训练体系结构:** 3. **三阶段训练课程**:此方法逐步细化预测从粗略结构到精细、时间连贯细节的过程。 实验结果显示,在Jetson AGX Orin上,RAPTOR是第一个在512^2视频中超过每秒30帧的预测器,并且在UAVid、KTH及自定义高分辨率数据集上分别在PSNR(峰值信噪比)、SSIM(结构相似度指标)和LPIPS(感知线性图像嵌入距离)方面设置了新的行业标准。最重要的是,RAPTOR在现实世界中的无人机导航任务中将成功率提高了18%,为更安全、更具预见性的自主体铺平了道路。 通过这些创新,RAPTOR架构不仅解决了视频预测领域的一个核心问题,还显著提升了实时高分辨率应用的性能,特别是对于需要即时决策的应用场景,如无人驾驶航空器。
https://arxiv.org/abs/2512.21710
We present STORM (Search-Guided Generative World Models), a novel framework for spatio-temporal reasoning in robotic manipulation that unifies diffusion-based action generation, conditional video prediction, and search-based planning. Unlike prior Vision-Language-Action (VLA) models that rely on abstract latent dynamics or delegate reasoning to language components, STORM grounds planning in explicit visual rollouts, enabling interpretable and foresight-driven decision-making. A diffusion-based VLA policy proposes diverse candidate actions, a generative video world model simulates their visual and reward outcomes, and Monte Carlo Tree Search (MCTS) selectively refines plans through lookahead evaluation. Experiments on the SimplerEnv manipulation benchmark demonstrate that STORM achieves a new state-of-the-art average success rate of 51.0 percent, outperforming strong baselines such as CogACT. Reward-augmented video prediction substantially improves spatio-temporal fidelity and task relevance, reducing Frechet Video Distance by over 75 percent. Moreover, STORM exhibits robust re-planning and failure recovery behavior, highlighting the advantages of search-guided generative world models for long-horizon robotic manipulation.
我们提出了一种名为STORM(基于搜索引导的生成式世界模型)的新框架,该框架用于机器人操作中的时空推理。STORM将扩散动作生成、条件视频预测和基于搜索的规划统一起来。与之前的视觉-语言-行动(VLA)模型不同,这些模型依赖于抽象潜在动力学或将推理委托给语言组件,STORM通过显式的视觉展开来实现计划制定,从而支持可解释且具有前瞻性驱动决策能力的操作。 具体来说,一个基于扩散的VLA策略提出多样化的候选动作,生成式视频世界模型模拟了它们的视觉和奖励结果,而蒙特卡洛树搜索(MCTS)则通过前瞻评估选择性地优化计划。在SimplerEnv操作基准测试中进行的实验表明,STORM达到了新的最先进平均成功率51.0%,优于诸如CogACT等强大的基线模型。 奖励增强型视频预测显著提高了时空准确性和任务相关性,将Frechet视频距离减少了超过75%。此外,STORM展示了稳健的重新规划和故障恢复行为,强调了搜索引导生成式世界模型在长时域机器人操作中的优势。
https://arxiv.org/abs/2512.18477
Robotic arm manipulation in data-scarce settings is a highly challenging task due to the complex embodiment dynamics and diverse contexts. Recent video-based approaches have shown great promise in capturing and transferring the temporal and physical interactions by pre-training on Internet-scale video data. However, such methods are often not optimized for the embodiment-specific closed-loop control, typically suffering from high latency and insufficient grounding. In this paper, we present Vidarc (Video Diffusion for Action Reasoning and Closed-loop Control), a novel autoregressive embodied video diffusion approach augmented by a masked inverse dynamics model. By grounding video predictions with action-relevant masks and incorporating real-time feedback through cached autoregressive generation, Vidarc achieves fast, accurate closed-loop control. Pre-trained on one million cross-embodiment episodes, Vidarc surpasses state-of-the-art baselines, achieving at least a 15% higher success rate in real-world deployment and a 91% reduction in latency. We also highlight its robust generalization and error correction capabilities across previously unseen robotic platforms.
在数据匮乏的环境中进行机器人手臂操作是一项极具挑战性的任务,因为需要处理复杂的物理动力学和多种多样的场景。近期基于视频的方法通过在互联网规模的视频数据上预训练,展现出了捕捉并转移时序与物理交互的强大潜力。然而,这些方法通常并未针对特定执行体(如机器人臂)的闭环控制进行优化,往往存在高延迟和不够精确的问题。 本文介绍了Vidarc(Video Diffusion for Action Reasoning and Closed-loop Control),一种新颖的基于视频扩散的自回归式具身化方法,并通过加入掩码逆动力学模型得到增强。该方法通过使用与动作相关的掩模来将视频预测与实际操作关联起来,同时利用缓存的自回归生成方式实时反馈调整,实现了快速且精准的闭环控制。经过针对不同执行体百万级别的数据集预训练后,Vidarc在真实环境中的部署中,相较于现有最佳基准模型成功率提高了至少15%,并且延迟降低了91%。此外,本文还强调了其在未见过的新机器人平台上的鲁棒性泛化和错误校正能力。 这一技术的进步为解决机器人手臂操作领域内的数据不足问题提供了一种新的有效途径,并展示了自回归视频扩散方法在未来具身智能系统中的巨大潜力。
https://arxiv.org/abs/2512.17661
Recent advances in diffusion transformers have empowered video generation models to generate high-quality video clips from texts or images. However, world models with the ability to predict long-horizon futures from past observations and actions remain underexplored, especially for general-purpose scenarios and various forms of actions. To bridge this gap, we introduce Astra, an interactive general world model that generates real-world futures for diverse scenarios (e.g., autonomous driving, robot grasping) with precise action interactions (e.g., camera motion, robot action). We propose an autoregressive denoising architecture and use temporal causal attention to aggregate past observations and support streaming outputs. We use a noise-augmented history memory to avoid over-reliance on past frames to balance responsiveness with temporal coherence. For precise action control, we introduce an action-aware adapter that directly injects action signals into the denoising process. We further develop a mixture of action experts that dynamically route heterogeneous action modalities, enhancing versatility across diverse real-world tasks such as exploration, manipulation, and camera control. Astra achieves interactive, consistent, and general long-term video prediction and supports various forms of interactions. Experiments across multiple datasets demonstrate the improvements of Astra in fidelity, long-range prediction, and action alignment over existing state-of-the-art world models.
最近的扩散变换器进展使视频生成模型能够从文本或图像中生成高质量的视频片段。然而,具有预测长期未来的能力的世界模型(基于过去观察和行动)在一般用途场景以及各种形式的动作方面仍然较少被探索。为了填补这一空白,我们引入了Astra,这是一个交互式通用世界模型,它为多样化的场景(如自动驾驶、机器人抓取等)生成现实世界的未来,并能够进行精确的动作互动(如相机移动、机器人动作)。我们提出了一个自回归去噪架构,并使用时间因果注意力机制来聚合过去的观察结果并支持流输出。为了不依赖于过去的帧而过度,我们采用了一种噪声增强的历史记忆方法,以平衡响应性与时间一致性。对于精确的动作控制,我们引入了一个感知动作的适配器,它可以直接将动作信号注入去噪过程中。此外,我们开发了一组行动专家混合模型,这些模型能够动态地路由不同形式的动作模态,增强了在探索、操作和相机控制等多样化现实世界任务中的适应性。Astra实现了交互式的长期视频预测,并支持多种形式的互动。跨多个数据集的实验显示了Astra在精确度、长范围预测及动作对齐方面相对于现有最先进的世界模型的进步。
https://arxiv.org/abs/2512.08931
In this work, we investigate diffusion-based video prediction models, which forecast future video frames, for continuous video streams. In this context, the models observe continuously new training samples, and we aim to leverage this to improve their predictions. We thus propose an approach that continuously adapts a pre-trained diffusion model to a video stream. Since fine-tuning the parameters of a large diffusion model is too expensive, we refine the diffusion noise during inference while keeping the model parameters frozen, allowing the model to adaptively determine suitable sampling noise. We term the approach Sequence Adaptive Video Prediction with Diffusion Noise Optimization (SAVi-DNO). To validate our approach, we introduce a new evaluation setting on the Ego4D dataset, focusing on simultaneous adaptation and evaluation on long continuous videos. Empirical results demonstrate improved performance based on FVD, SSIM, and PSNR metrics on long videos of Ego4D and OpenDV-YouTube, as well as videos of UCF-101 and SkyTimelapse, showcasing SAVi-DNO's effectiveness.
https://arxiv.org/abs/2511.18255
Egocentric video generation with fine-grained control through body motion is a key requirement towards embodied AI agents that can simulate, predict, and plan actions. In this work, we propose EgoControl, a pose-controllable video diffusion model trained on egocentric data. We train a video prediction model to condition future frame generation on explicit 3D body pose sequences. To achieve precise motion control, we introduce a novel pose representation that captures both global camera dynamics and articulated body movements, and integrate it through a dedicated control mechanism within the diffusion process. Given a short sequence of observed frames and a sequence of target poses, EgoControl generates temporally coherent and visually realistic future frames that align with the provided pose control. Experimental results demonstrate that EgoControl produces high-quality, pose-consistent egocentric videos, paving the way toward controllable embodied video simulation and understanding.
https://arxiv.org/abs/2511.18173
Generating visual instructions in a given context is essential for developing interactive world simulators. While prior works address this problem through either text-guided image manipulation or video prediction, these tasks are typically treated in isolation. This separation reveals a fundamental issue: image manipulation methods overlook how actions unfold over time, while video prediction models often ignore the intended outcomes. To this end, we propose ShowMe, a unified framework that enables both tasks by selectively activating the spatial and temporal components of video diffusion models. In addition, we introduce structure and motion consistency rewards to improve structural fidelity and temporal coherence. Notably, this unification brings dual benefits: the spatial knowledge gained through video pretraining enhances contextual consistency and realism in non-rigid image edits, while the instruction-guided manipulation stage equips the model with stronger goal-oriented reasoning for video prediction. Experiments on diverse benchmarks demonstrate that our method outperforms expert models in both instructional image and video generation, highlighting the strength of video diffusion models as a unified action-object state transformer.
https://arxiv.org/abs/2511.17839
Video prediction is a fundamental task for various downstream applications, including robotics and world modeling. Although general video prediction models have achieved remarkable performance in standard scenarios, occlusion is still an inherent challenge in video prediction. We hypothesize that providing explicit information about motion (via point-flow) and geometric structure (via depth-maps) will enable video prediction models to perform better in situations with occlusion and the background motion. To investigate this, we present the first systematic study dedicated to occluded video prediction. We use a standard multi-object latent transformer architecture to predict future frames, but modify this to incorporate information from depth and point-flow. We evaluate this model in a controlled setting on both synthetic and real-world datasets with not only appearance-based metrics but also Wasserstein distances on object masks, which can effectively measure the motion distribution of the prediction. We find that when the prediction model is assisted with point flow and depth, it performs better in occluded scenarios and predicts more accurate background motion compared to models without the help of these modalities.
https://arxiv.org/abs/2511.16484
Can we turn a video prediction model into a robot policy? Videos, including those of humans or teleoperated robots, capture rich physical interactions. However, most of them lack labeled actions, which limits their use in robot learning. We present Video Prediction for Robot Actions (ViPRA), a simple pretraining-finetuning framework that learns continuous robot control from these actionless videos. Instead of directly predicting actions, we train a video-language model to predict both future visual observations and motion-centric latent actions, which serve as intermediate representations of scene dynamics. We train these latent actions using perceptual losses and optical flow consistency to ensure they reflect physically grounded behavior. For downstream control, we introduce a chunked flow matching decoder that maps latent actions to robot-specific continuous action sequences, using only 100 to 200 teleoperated demonstrations. This approach avoids expensive action annotation, supports generalization across embodiments, and enables smooth, high-frequency continuous control upto 22 Hz via chunked action decoding. Unlike prior latent action works that treat pretraining as autoregressive policy learning, explicitly models both what changes and how. Our method outperforms strong baselines, with a 16% gain on the SIMPLER benchmark and a 13% improvement across real world manipulation tasks. We will release models and code at this https URL
https://arxiv.org/abs/2511.07732
Behavior cloning methods for robot learning suffer from poor generalization due to limited data support beyond expert demonstrations. Recent approaches leveraging video prediction models have shown promising results by learning rich spatiotemporal representations from large-scale datasets. However, these models learn action-agnostic dynamics that cannot distinguish between different control inputs, limiting their utility for precise manipulation tasks and requiring large pretraining datasets. We propose a Dynamics-Aligned Flow Matching Policy (DAP) that integrates dynamics prediction into policy learning. Our method introduces a novel architecture where policy and dynamics models provide mutual corrective feedback during action generation, enabling self-correction and improved generalization. Empirical validation demonstrates generalization performance superior to baseline methods on real-world robotic manipulation tasks, showing particular robustness in OOD scenarios including visual distractions and lighting variations.
https://arxiv.org/abs/2510.27114
Learning generalizable robotic manipulation policies remains a key challenge due to the scarcity of diverse real-world training data. While recent approaches have attempted to mitigate this through self-supervised representation learning, most either rely on 2D vision pretraining paradigms such as masked image modeling, which primarily focus on static semantics or scene geometry, or utilize large-scale video prediction models that emphasize 2D dynamics, thus failing to jointly learn the geometry, semantics, and dynamics required for effective manipulation. In this paper, we present DynaRend, a representation learning framework that learns 3D-aware and dynamics-informed triplane features via masked reconstruction and future prediction using differentiable volumetric rendering. By pretraining on multi-view RGB-D video data, DynaRend jointly captures spatial geometry, future dynamics, and task semantics in a unified triplane representation. The learned representations can be effectively transferred to downstream robotic manipulation tasks via action value map prediction. We evaluate DynaRend on two challenging benchmarks, RLBench and Colosseum, as well as in real-world robotic experiments, demonstrating substantial improvements in policy success rate, generalization to environmental perturbations, and real-world applicability across diverse manipulation tasks.
https://arxiv.org/abs/2510.24261
Inspired by the performance and scalability of autoregressive large language models (LLMs), transformer-based models have seen recent success in the visual domain. This study investigates a transformer adaptation for video prediction with a simple end-to-end approach, comparing various spatiotemporal self-attention layouts. Focusing on causal modeling of physical simulations over time; a common shortcoming of existing video-generative approaches, we attempt to isolate spatiotemporal reasoning via physical object tracking metrics and unsupervised training on physical simulation datasets. We introduce a simple yet effective pure transformer model for autoregressive video prediction, utilizing continuous pixel-space representations for video prediction. Without the need for complex training strategies or latent feature-learning components, our approach significantly extends the time horizon for physically accurate predictions by up to 50% when compared with existing latent-space approaches, while maintaining comparable performance on common video quality metrics. In addition, we conduct interpretability experiments to identify network regions that encode information useful to perform accurate estimations of PDE simulation parameters via probing models, and find that this generalizes to the estimation of out-of-distribution simulation parameters. This work serves as a platform for further attention-based spatiotemporal modeling of videos via a simple, parameter efficient, and interpretable approach.
受到自回归大型语言模型(LLMs)在性能和可扩展性方面的启发,基于Transformer的模型在视觉领域取得了近期的成功。本研究探讨了一种用于视频预测的Transformer适应方案,并采用简单的端到端方法比较了各种时空自我注意布局。专注于对物理仿真进行因果建模,这通常是现有视频生成方法的一个常见不足之处,我们尝试通过物理对象跟踪指标和在物理仿真数据集上的无监督训练来隔离时空推理。 我们引入了一种简单而有效的纯Transformer模型用于自回归视频预测,该模型利用连续的像素空间表示来进行视频预测。无需复杂的训练策略或潜在特征学习组件,我们的方法与现有的潜在空间方法相比,在物理准确性的预测时间范围上提高了50%以上,同时保持了在常见视频质量指标上的相当性能。 此外,我们进行了可解释性实验以识别网络区域,这些区域通过探针模型编码执行PDE(偏微分方程)仿真参数的精确估计所需的信息,并发现这可以推广到对分布外仿真参数的估计。这项工作提供了一个平台,用于进一步基于注意力机制进行视频的时空建模,采用简单、参数高效且可解释的方法。
https://arxiv.org/abs/2510.20807
Extracting the true dynamical variables of a system from high-dimensional video is challenging due to distracting visual factors such as background motion, occlusions, and texture changes. We propose LyTimeT, a two-phase framework for interpretable variable extraction that learns robust and stable latent representations of dynamical systems. In Phase 1, LyTimeT employs a spatio-temporal TimeSformer-based autoencoder that uses global attention to focus on dynamically relevant regions while suppressing nuisance variation, enabling distraction-robust latent state learning and accurate long-horizon video prediction. In Phase 2, we probe the learned latent space, select the most physically meaningful dimensions using linear correlation analysis, and refine the transition dynamics with a Lyapunov-based stability regularizer to enforce contraction and reduce error accumulation during roll-outs. Experiments on five synthetic benchmarks and four real-world dynamical systems, including chaotic phenomena, show that LyTimeT achieves mutual information and intrinsic dimension estimates closest to ground truth, remains invariant under background perturbations, and delivers the lowest analytical mean squared error among CNN-based (TIDE) and transformer-only baselines. Our results demonstrate that combining spatio-temporal attention with stability constraints yields predictive models that are not only accurate but also physically interpretable.
从高维视频中提取系统的真正动态变量是一个挑战,因为会受到诸如背景运动、遮挡和纹理变化等分散注意力的视觉因素的影响。我们提出了一种名为LyTimeT的两阶段框架,该框架旨在学习鲁棒且稳定的潜在表示来提取可解释的变量。在第一阶段,LyTimeT采用基于时空Transformer(TimeSformer)的自编码器,利用全局注意机制聚焦于动态相关的区域并抑制干扰变化,从而实现抗分心的隐态学习和准确的长时视频预测。 第二阶段中,我们探究已学得的潜在空间,使用线性相关分析选择最具有物理意义的维度,并通过基于李雅普诺夫稳定性正则化的约束来优化转换动力学,以确保收缩并减少在展开过程中的误差累积。实验结果表明,在五个合成基准和四个真实世界动态系统(包括混沌现象)上,LyTimeT能够提供与实际情况最为接近的互信息和固有维度估计值,并且在背景扰动下保持不变性,同时在基于CNN(TIDE)和其他仅基于Transformer的方法中具有最低的解析均方误差。 我们的研究结果表明,结合时空注意力机制与稳定性约束可以生成不仅准确而且物理可解释的预测模型。
https://arxiv.org/abs/2510.19716