The task of video geolocalization aims to determine the precise GPS coordinates of a video's origin and map its trajectory; with applications in forensics, social media, and exploration. Existing classification-based approaches operate at a coarse city-level granularity and fail to capture fine-grained details, while image retrieval methods are impractical on a global scale due to the need for extensive image galleries which are infeasible to compile. Comparatively, constructing a gallery of GPS coordinates is straightforward and inexpensive. We propose VidTAG, a dual-encoder framework that performs frame-to-GPS retrieval using both self-supervised and language-aligned features. To address temporal inconsistencies in video predictions, we introduce the TempGeo module, which aligns frame embeddings, and the GeoRefiner module, an encoder-decoder architecture that refines GPS features using the aligned frame embeddings. Evaluations on Mapillary (MSLS) and GAMa datasets demonstrate our model's ability to generate temporally consistent trajectories and outperform baselines, achieving a 20% improvement at the 1 km threshold over GeoCLIP. We also beat current State-of-the-Art by 25% on global coarse grained video geolocalization (CityGuessr68k). Our approach enables fine-grained video geolocalization and lays a strong foundation for future research. More details on the project webpage: this https URL
https://arxiv.org/abs/2604.12159
Accurate future video prediction requires both high visual fidelity and consistent scene semantics, particularly in complex dynamic environments such as autonomous driving. We present Re2Pix, a hierarchical video prediction framework that decomposes forecasting into two stages: semantic representation prediction and representation-guided visual synthesis. Instead of directly predicting future RGB frames, our approach first forecasts future scene structure in the feature space of a frozen vision foundation model, and then conditions a latent diffusion model on these predicted representations to render photorealistic frames. This decomposition enables the model to focus first on scene dynamics and then on appearance generation. A key challenge arises from the train-test mismatch between ground-truth representations available during training and predicted ones used at inference. To address this, we introduce two conditioning strategies, nested dropout and mixed supervision, that improve robustness to imperfect autoregressive predictions. Experiments on challenging driving benchmarks demonstrate that the proposed semantics-first design significantly improves temporal semantic consistency, perceptual quality, and training efficiency compared to strong diffusion baselines. We provide the implementation code at this https URL
https://arxiv.org/abs/2604.11707
Recent years have seen remarkable progress in autonomous driving, yet generalization to long-tail and open-world scenarios remains a major bottleneck for large-scale deployment. To address this challenge, some works use LLMs and VLMs for vision-language understanding and reasoning, enabling vehicles to interpret rare and safety-critical situations when generating actions. Others study generative world models to capture the spatio-temporal evolution of driving scenes, allowing agents to imagine possible futures before acting. Inspired by human intelligence, which unifies understanding and imagination, we explore a unified model for autonomous driving. We present LMGenDrive, the first framework that combines LLM-based multimodal understanding with generative world models for end-to-end closed-loop driving. Given multi-view camera inputs and natural-language instructions, LMGenDrive generates both future driving videos and control signals. This design provides complementary benefits: video prediction improves spatio-temporal scene modeling, while the LLM contributes strong semantic priors and instruction grounding from large-scale pretraining. We further propose a progressive three-stage training strategy, from vision pretraining to multi-step long-horizon driving, to improve stability and performance. LMGenDrive supports both low-latency online planning and autoregressive offline video generation. Experiments show that it significantly outperforms prior methods on challenging closed-loop benchmarks, with clear gains in instruction following, spatio-temporal understanding, and robustness to rare scenarios. These results suggest that unifying multimodal understanding and generation is a promising direction for more generalizable and robust embodied decision-making systems.
https://arxiv.org/abs/2604.08719
Video prediction is a useful function for autonomous driving, enabling intelligent vehicles to reliably anticipate how driving scenes will evolve and thereby supporting reasoning and safer planning. However, existing models are constrained by multi-stage training pipelines and remain insufficient in modeling the diverse motion patterns in real driving scenes, leading to degraded temporal consistency and visual quality. To address these challenges, this paper introduces the historical motion priors-informed diffusion model (HMPDM), a video prediction model that leverages historical motion priors to enhance motion understanding and temporal coherence. The proposed deep learning system introduces three key designs: (i) a Temporal-aware Latent Conditioning (TaLC) module for implicit historical motion injection; (ii) a Motion-aware Pyramid Encoder (MaPE) for multi-scale motion representation; (iii) a Self-Conditioning (SC) strategy for stable iterative denoising. Extensive experiments on the Cityscapes and KITTI benchmarks demonstrate that HMPDM outperforms state-of-the-art video prediction methods with efficiency, achieving a 28.2% improvement in FVD on Cityscapes under the same monocular RGB input configuration setting. The implementation codes are publicly available at this https URL.
https://arxiv.org/abs/2603.27371
World Action Models (WAMs) have emerged as a promising alternative to Vision-Language-Action (VLA) models for embodied control because they explicitly model how visual observations may evolve under action. Most existing WAMs follow an imagine-then-execute paradigm, incurring substantial test-time latency from iterative video denoising, yet it remains unclear whether explicit future imagination is actually necessary for strong action performance. In this paper, we ask whether WAMs need explicit future imagination at test time, or whether their benefit comes primarily from video modeling during training. We disentangle the role of video modeling during training from explicit future generation during inference by proposing \textbf{Fast-WAM}, a WAM architecture that retains video co-training during training but skips future prediction at test time. We further instantiate several Fast-WAM variants to enable a controlled comparison of these two factors. Across these variants, we find that Fast-WAM remains competitive with imagine-then-execute variants, while removing video co-training causes a much larger performance drop. Empirically, Fast-WAM achieves competitive results with state-of-the-art methods both on simulation benchmarks (LIBERO and RoboTwin) and real-world tasks, without embodied pretraining. It runs in real time with 190ms latency, over 4$\times$ faster than existing imagine-then-execute WAMs. These results suggest that the main value of video prediction in WAMs may lie in improving world representations during training rather than generating future observations at test time. Project page: this https URL
https://arxiv.org/abs/2603.16666
Vision-Language-Action (VLA) models have emerged as a promising paradigm for robot learning, but their representations are still largely inherited from static image-text pretraining, leaving physical dynamics to be learned from comparatively limited action data. Generative video models, by contrast, encode rich spatiotemporal structure and implicit physics, making them a compelling foundation for robotic manipulation. But their potentials are not fully explored in the literature. To bridge the gap, we introduce DiT4DiT, an end-to-end Video-Action Model that couples a video Diffusion Transformer with an action Diffusion Transformer in a unified cascaded framework. Instead of relying on reconstructed future frames, DiT4DiT extracts intermediate denoising features from the video generation process and uses them as temporally grounded conditions for action prediction. We further propose a dual flow-matching objective with decoupled timesteps and noise scales for video prediction, hidden-state extraction, and action inference, enabling coherent joint training of both modules. Across simulation and real-world benchmarks, DiT4DiT achieves state-of-the-art results, reaching average success rates of 98.6% on LIBERO and 50.8% on RoboCasa GR1 while using substantially less training data. On the Unitree G1 robot, it also delivers superior real-world performance and strong zero-shot generalization. Importantly, DiT4DiT improves sample efficiency by over 10x and speeds up convergence by up to 7x, demonstrating that video generation can serve as an effective scaling proxy for robot policy learning. We release code and models at this https URL.
视觉-语言-行动(VLA)模型作为机器人学习的一个有前景的范式已经出现,但它们的表示方法仍然很大程度上继承自静态图像和文本预训练,使得物理动态只能从相对有限的动作数据中被学习。相比之下,生成视频模型编码了丰富的时空结构以及隐含的物理学原理,这使它们成为操纵机器人的一个非常有吸引力的基础。然而,在文献中这些潜力尚未得到充分探索。 为弥合这一差距,我们引入了DiT4DiT,这是一个端到端的视频-动作模型,它在一个统一级联框架内将视频扩散变换器与行动扩散变换器耦合在一起。不同于依赖于重建的未来帧,DiT4DiT从视频生成过程中提取中间降噪特征,并将其作为时间基础条件用于动作预测。我们还提出了一个解耦的时间步和噪声尺度的双流匹配目标,用于视频预测、隐藏状态提取以及行动推断,从而使两个模块能够进行连贯的联合训练。 在模拟环境和现实世界基准测试中,DiT4DiT取得了最先进的成果,在LIBERO上达到了平均成功率98.6%,在RoboCasa GR1上达到50.8%的成功率,而使用的训练数据却少得多。此外,在Unitree G1机器人上,它也展示了卓越的实际性能和强大的零样本泛化能力。 重要的是,DiT4DiT通过超过10倍的提高样本效率以及最多7倍加快收敛速度,证明了视频生成可以作为有效的比例因子用于机器人的策略学习。我们将在上述网址发布代码和模型。
https://arxiv.org/abs/2603.10448
Action-conditioned video prediction models (often referred to as world models) have shown strong potential for robotics applications, but existing approaches are often slow and struggle to capture physically consistent interactions over long horizons, limiting their usefulness for scalable robot policy training and evaluation. We present Interactive World Simulator, a framework for building interactive world models from a moderate-sized robot interaction dataset. Our approach leverages consistency models for both image decoding and latent-space dynamics prediction, enabling fast and stable simulation of physical interactions. In our experiments, the learned world models produce interaction-consistent pixel-level predictions and support stable long-horizon interactions for more than 10 minutes at 15 FPS on a single RTX 4090 GPU. Our framework enables scalable demonstration collection solely within the world models to train state-of-the-art imitation policies. Through extensive real-world evaluation across diverse tasks involving rigid objects, deformable objects, object piles, and their interactions, we find that policies trained on world-model-generated data perform comparably to those trained on the same amount of real-world data. Additionally, we evaluate policies both within the world models and in the real world across diverse tasks, and observe a strong correlation between simulated and real-world performance. Together, these results establish the Interactive World Simulator as a stable and physically consistent surrogate for scalable robotic data generation and faithful, reproducible policy evaluation.
基于动作条件的视频预测模型(通常称为世界模型)在机器人应用中显示出强大的潜力,但现有的方法往往速度较慢,并且难以捕捉长时间范围内的物理一致交互,这限制了它们对可扩展机器人策略训练和评估的实用性。我们提出了互动世界模拟器(Interactive World Simulator),这是一个用于从适度大小的机器人互动数据集中构建互动世界模型的框架。我们的方法利用一致性模型进行图像解码以及隐空间动态预测,从而能够快速且稳定地模拟物理交互。在实验中,所学习的世界模型能够产生交互一致的像素级预测,并支持在一个RTX 4090 GPU上以每秒15帧的速度进行超过十分钟的长时间范围内的稳定互动。 我们的框架允许完全在世界模型内部收集大规模演示数据来训练最先进的模仿策略。通过跨涉及刚性物体、可变形物体、堆积物及其相互作用等多种任务的真实世界评估,我们发现基于世界模型生成的数据训练出的策略与同样数量真实世界数据训练出来的策略表现相当。此外,我们在世界模型和现实世界中对各种任务进行策略评价,并观察到模拟性能与实际世界性能之间存在强烈的关联性。 综上所述,这些结果确立了互动世界模拟器作为一个稳定且物理一致的替代方案,在大规模机器人数据生成以及忠实、可重复的策略评估方面具有重要作用。
https://arxiv.org/abs/2603.08546
We present DreamToNav, a novel autonomous robot framework that uses generative video models to enable intuitive, human-in-the-loop control. Instead of relying on rigid waypoint navigation, users provide natural language prompts (e.g. ``Follow the person carefully''), which the system translates into executable motion. Our pipeline first employs Qwen 2.5-VL-7B-Instruct to refine vague user instructions into precise visual descriptions. These descriptions condition NVIDIA Cosmos 2.5, a state-of-the-art video foundation model, to synthesize a physically consistent video sequence of the robot performing the task. From this synthetic video, we extract a valid kinematic path using visual pose estimation, robot detection and trajectory recovery. By treating video generation as a planning engine, DreamToNav allows robots to visually "dream" complex behaviors before executing them, providing a unified framework for obstacle avoidance and goal-directed navigation without task-specific engineering. We evaluate the approach on both a wheeled mobile robot and a quadruped robot in indoor navigation tasks. DreamToNav achieves a success rate of 76.7%, with final goal errors typically within 0.05-0.10 m and trajectory tracking errors below 0.15 m. These results demonstrate that trajectories extracted from generative video predictions can be reliably executed on physical robots across different locomotion platforms.
我们介绍了DreamToNav,这是一种新颖的自主机器人框架,它使用生成式视频模型来实现直观、人机协作控制。与依赖固定的航点导航不同,用户可以提供自然语言指令(例如“小心地跟随那个人”),系统将这些指令转换为可执行的动作。我们的流程首先利用Qwen 2.5-VL-7B-Instruct将模糊的用户指令细化为精确的视觉描述。然后使用最先进的视频基础模型NVIDIA Cosmos 2.5根据这些描述合成一段机器人执行任务时的物理上一致的视频序列。从这段合成视频中,我们通过视觉姿态估计、机器人检测和轨迹恢复提取出有效的运动路径。将视频生成视为规划引擎后,DreamToNav可以让机器人在执行之前“想象”复杂的行为,为不同平台提供统一的避障及目标导向导航框架,并且无需针对特定任务进行工程设计。 我们在室内导航任务中对这款轮式移动机器人和四足机器人的DreamToNav进行了评估。DreamToNav的成功率为76.7%,最终目标误差通常在0.05-0.10米之间,轨迹跟踪误差低于0.15米。这些结果表明,从生成视频预测中提取的轨迹可以在不同运动平台上的物理机器人上可靠地执行。
https://arxiv.org/abs/2603.06190
Robotic manipulation requires anticipating how the environment evolves in response to actions, yet most existing systems lack this predictive capability, often resulting in errors and inefficiency. While Vision-Language Models (VLMs) provide high-level guidance, they cannot explicitly forecast future states, and existing world models either predict only short horizons or produce spatially inconsistent frames. To address these challenges, we propose a framework for fast and predictive video-conditioned action. Our approach first selects and adapts a robust video generation model to ensure reliable future predictions, then applies adversarial distillation for fast, few-step video generation, and finally trains an action model that leverages both generated videos and real observations to correct spatial errors. Extensive experiments show that our method produces temporally coherent, spatially accurate video predictions that directly support precise manipulation, achieving significant improvements in embodiment consistency, spatial referring ability, and task completion over existing baselines. Codes & Models will be released.
机器人操作需要预测环境在行动后的演变情况,然而大多数现有系统缺乏这种预测能力,导致经常出现错误和效率低下。虽然视觉-语言模型(VLMs)提供了高层次的指导,但它们无法明确地预测未来的状态,而现有的世界模型要么只能预测短期的时间范围,要么生成的空间上不一致的画面。为了解决这些问题,我们提出了一种快速且具备预测能力的视频条件动作框架。 我们的方法首先选择并调整一个稳健的视频生成模型,以确保可靠的未来预测;然后应用对抗蒸馏(adversarial distillation)进行快速、少步骤的视频生成;最后训练一个利用生成的视频和真实观察来纠正空间错误的动作模型。大量的实验表明,我们提出的方法能够产生时间上连贯且空间上准确的视频预测,直接支持精确的操作,并在具身一致性(embodiment consistency)、空间指代能力以及任务完成度方面显著优于现有的基准方法。 代码与模型将在适当的时候发布。
https://arxiv.org/abs/2602.10717
Predicting physical dynamics from raw visual data remains a major challenge in AI. While recent video generation models have achieved impressive visual quality, they still cannot consistently generate physically plausible videos due to a lack of modeling of physical laws. Recent approaches combining 3D Gaussian splatting and physics engines can produce physically plausible videos, but are hindered by high computational costs in both reconstruction and simulation, and often lack robustness in complex real-world scenarios. To address these issues, we introduce Neural Gaussian Force Field (NGFF), an end-to-end neural framework that integrates 3D Gaussian perception with physics-based dynamic modeling to generate interactive, physically realistic 4D videos from multi-view RGB inputs, achieving two orders of magnitude faster than prior Gaussian simulators. To support training, we also present GSCollision, a 4D Gaussian dataset featuring diverse materials, multi-object interactions, and complex scenes, totaling over 640k rendered physical videos (~4 TB). Evaluations on synthetic and real 3D scenarios show NGFF's strong generalization and robustness in physical reasoning, advancing video prediction towards physics-grounded world models.
https://arxiv.org/abs/2602.00148
We present Akasha 2, a state-of-the-art multimodal architecture that integrates Hamiltonian State Space Duality (H-SSD) with Visual-Language Joint Embedding Predictive Architecture (VL-JEPA). The system leverages the Mamba-3 Selective State Space Model (SSM) augmented by a Sparse Mixture of Hamiltonian Experts (SMoE-HE) that enforces latent physical conservation laws through symplectic integration. For visual synthesis, we introduce Hamiltonian Flow Matching (HFM) and persistent 3D Gaussian Splatting (3DGS), enabling ultra-low latency (<50ms) on mobile hardware. This work establishes a new paradigm in latent world models, achieving unprecedented spatiotemporal coherence through a holographic memory architecture. Our approach demonstrates that incorporating physics-inspired inductive biases into neural architectures yields significant improvements: state-of-the-art video prediction (FVD: 287), 4x faster visual synthesis than diffusion models, and 3-18x inference speedup over transformer baselines while maintaining energy conservation over extended horizons.
我们介绍了Akasha 2,这是一种最先进的多模态架构,将汉密尔顿状态空间对偶性(H-SSD)与视觉语言联合嵌入预测架构(VL-JEPA)结合在一起。该系统利用了由稀疏汉密尔顿专家混合模型(SMoE-HE)增强的Mamba-3选择性状态空间模型(SSM),通过辛几何积分方法强制执行潜在的物理守恒定律。在视觉合成方面,我们引入了汉密尔顿流匹配(HFM)和持久3D高斯点阵(3DGS),这使得在移动硬件上实现极低延迟(<50毫秒)成为可能。这项工作建立了一种新的潜在世界模型范式,通过全息记忆架构实现了前所未有的时空一致性。我们的方法证明,在神经网络架构中加入物理学启发的归纳偏差可以带来显著改进:视频预测达到了最先进的水平(FVD: 287),视觉合成速度是扩散模型的4倍,并且推理速度比Transformer基线快3-18倍,同时还能保持长时间范围内的能量守恒。
https://arxiv.org/abs/2601.06212
Reinforcement learning based post-training paradigms for Video Large Language Models (VideoLLMs) have achieved significant success by optimizing for visual-semantic tasks such as captioning or VideoQA. However, while these approaches effectively enhance perception abilities, they primarily target holistic content understanding, often lacking explicit supervision for intrinsic temporal coherence and inter-frame correlations. This tendency limits the models' ability to capture intricate dynamics and fine-grained visual causality. To explicitly bridge this gap, we propose a novel post-training objective: Masked Video Prediction (MVP). By requiring the model to reconstruct a masked continuous segment from a set of challenging distractors, MVP forces the model to attend to the sequential logic and temporal context of events. To support scalable training, we introduce a scalable data synthesis pipeline capable of transforming arbitrary video corpora into MVP training samples, and further employ Group Relative Policy Optimization (GRPO) with a fine-grained reward function to enhance the model's understanding of video context and temporal properties. Comprehensive evaluations demonstrate that MVP enhances video reasoning capabilities by directly reinforcing temporal reasoning and causal understanding.
基于强化学习的视频大型语言模型(VideoLLM)在视觉语义任务如描述和视频问答中的后训练范式取得了显著成功。然而,尽管这些方法有效提高了感知能力,并且主要针对整体内容理解进行了优化,它们通常缺乏对内在时间连贯性和帧间相关性的显式监督。这种倾向限制了模型捕捉复杂动态变化和细粒度视觉因果关系的能力。 为了明确填补这一缺口,我们提出了一种新的后训练目标:遮罩视频预测(Masked Video Prediction, MVP)。通过要求模型从一系列具有挑战性的干扰项中重构一个被遮盖的连续片段,MVP迫使模型关注事件的顺序逻辑和时间背景。为支持可扩展性训练,我们引入了一个能够将任意视频语料库转换成用于MVP训练样本的数据合成流水线,并进一步采用细粒度奖励函数的分组相对策略优化(Group Relative Policy Optimization, GRPO)来增强模型对视频背景的理解及时间属性。 全面的评估表明,通过直接强化时间推理和因果理解能力,MVP提升了视频推理的能力。
https://arxiv.org/abs/2601.03781
Prevalent Vision-Language-Action (VLA) models are typically built upon Multimodal Large Language Models (MLLMs) and demonstrate exceptional proficiency in semantic understanding, but they inherently lack the capability to deduce physical world dynamics. Consequently, recent approaches have shifted toward World Models, typically formulated via video prediction; however, these methods often suffer from a lack of semantic grounding and exhibit brittleness when handling prediction errors. To synergize semantic understanding with dynamic predictive capabilities, we present InternVLA-A1. This model employs a unified Mixture-of-Transformers architecture, coordinating three experts for scene understanding, visual foresight generation, and action execution. These components interact seamlessly through a unified masked self-attention mechanism. Building upon InternVL3 and Qwen3-VL, we instantiate InternVLA-A1 at 2B and 3B parameter scales. We pre-train these models on hybrid synthetic-real datasets spanning InternData-A1 and Agibot-World, covering over 533M frames. This hybrid training strategy effectively harnesses the diversity of synthetic simulation data while minimizing the sim-to-real gap. We evaluated InternVLA-A1 across 12 real-world robotic tasks and simulation benchmark. It significantly outperforms leading models like pi0 and GR00T N1.5, achieving a 14.5\% improvement in daily tasks and a 40\%-73.3\% boost in dynamic settings, such as conveyor belt sorting.
流行的视觉-语言-行动(VLA)模型通常基于多模态大型语言模型(MLLM),在语义理解方面表现出色,但它们本质上缺乏推断物理世界动态的能力。因此,最近的方法转向了通过视频预测构建的世界模型;然而,这些方法往往缺乏语义基础,并且在处理预测错误时表现得不够稳定。为了将语义理解与动态预测能力相结合,我们提出了InternVLA-A1模型。该模型采用统一的Transformer混合架构,协调三个专家模块进行场景理解、视觉前瞻性生成和动作执行。这三个组件通过统一的掩码自注意力机制无缝互动。基于InternVL3和Qwen3-VL,我们在20亿和30亿参数规模上实例化了InternVLA-A1模型。我们利用混合合成-真实数据集(涵盖InternData-A1和Agibot-World)对这些模型进行了预训练,包括超过5.33亿帧的数据。这种混合训练策略有效利用了仿真数据的多样性,并尽量减少从模拟到现实的差距。我们在12个实际机器人任务及模拟基准测试中评估了InternVLA-A1模型,它显著优于pi0和GR00T N1.5等领先模型,在日常任务上提高了14.5%,在动态场景(如传送带分拣)下提高了40%-73.3%。
https://arxiv.org/abs/2601.02456
World models have become crucial for autonomous driving, as they learn how scenarios evolve over time to address the long-tail challenges of the real world. However, current approaches relegate world models to limited roles: they operate within ostensibly unified architectures that still keep world prediction and motion planning as decoupled processes. To bridge this gap, we propose DriveLaW, a novel paradigm that unifies video generation and motion planning. By directly injecting the latent representation from its video generator into the planner, DriveLaW ensures inherent consistency between high-fidelity future generation and reliable trajectory planning. Specifically, DriveLaW consists of two core components: DriveLaW-Video, our powerful world model that generates high-fidelity forecasting with expressive latent representations, and DriveLaW-Act, a diffusion planner that generates consistent and reliable trajectories from the latent of DriveLaW-Video, with both components optimized by a three-stage progressive training strategy. The power of our unified paradigm is demonstrated by new state-of-the-art results across both tasks. DriveLaW not only advances video prediction significantly, surpassing best-performing work by 33.3% in FID and 1.8% in FVD, but also achieves a new record on the NAVSIM planning benchmark.
世界模型在自动驾驶中变得至关重要,因为它们可以学习环境随时间如何演变,从而解决现实世界中的长尾挑战。然而,目前的方法将世界模型限制在有限的角色上:尽管这些方法旨在建立统一的架构框架,但仍然将世界预测和运动规划视为独立的过程。为了解决这个问题,我们提出了DriveLaW这一新范式,它统一了视频生成与运动规划过程。通过直接从其视频生成器中注入潜在表示到规划器,DriveLaW确保了高保真未来生成与可靠轨迹规划之间的内在一致性。 具体而言,DriveLaW包括两个核心组件:DriveLaW-Video,这是我们的强大世界模型,它能够生成具有表现力的潜在表示的高保真预测;以及DriveLaW-Act,这是一个扩散规划器,可以从DriveLaW-Video的潜在状态中产生一致且可靠的轨迹。这两个组成部分通过一个三阶段渐进式训练策略进行优化。 我们统一范式的强大之处在于其在视频预测和运动规划任务上都取得了新的最先进的成果。特别是在视频预测方面,DriveLaW不仅显著超越了表现最好的工作,在FID(Frechet Inception Distance)指标上领先33.3%,在FVD(Fréchet Video Distance)指标上领先1.8%;同时还在NAVSIM规划基准测试中创造了新的记录。
https://arxiv.org/abs/2512.23421
Motion prediction has been studied in different contexts with models trained on narrow distributions and applied to downstream tasks in human motion prediction and robotics. Simultaneously, recent efforts in scaling video prediction have demonstrated impressive visual realism, yet they struggle to accurately model complex motions despite massive scale. Inspired by the scaling of video generation, we develop autoregressive flow matching (ARFM), a new method for probabilistic modeling of sequential continuous data and train it on diverse video datasets to generate future point track locations over long horizons. To evaluate our model, we develop benchmarks for evaluating the ability of motion prediction models to predict human and robot motion. Our model is able to predict complex motions, and we demonstrate that conditioning robot action prediction and human motion prediction on predicted future tracks can significantly improve downstream task performance. Code and models publicly available at: this https URL.
在不同背景下,人们已经对运动预测进行了研究,并且通过使用窄分布训练的模型应用于下游的人体动作预测和机器人任务。同时,最近在扩展视频预测方面的努力展示了令人印象深刻的视觉逼真度,但它们难以准确地建模复杂的运动,尽管规模庞大。受到视频生成规模化工作的启发,我们开发了一种新的序列连续数据概率建模方法——自回归流匹配(ARFM),并利用多样化的视频数据集对其进行训练,以在长时间范围内生成未来的点轨迹位置。 为了评估我们的模型,我们为评估运动预测模型的性能制定了基准测试标准,用于衡量它们在预测人和机器人动作方面的能力。我们的模型能够预测复杂的运动,并且通过将机器人行为预测和人体动作预测条件化于预测出的未来轨迹上,可以显著提高下游任务的表现。 代码和模型可在以下网址公开获取:[提供的是一个占位符URL "this https URL",请替换为实际链接]。 这段文字介绍了自回归流匹配(ARFM)方法及其在运动预测中的应用,并且说明了通过使用该技术能够提升机器人与人类动作预测的准确性。
https://arxiv.org/abs/2512.22688
Video prediction is plagued by a fundamental trilemma: achieving high-resolution and perceptual quality typically comes at the cost of real-time speed, hindering its use in latency-critical applications. This challenge is most acute for autonomous UAVs in dense urban environments, where foreseeing events from high-resolution imagery is non-negotiable for safety. Existing methods, reliant on iterative generation (diffusion, autoregressive models) or quadratic-complexity attention, fail to meet these stringent demands on edge hardware. To break this long-standing trade-off, we introduce RAPTOR, a video prediction architecture that achieves real-time, high-resolution performance. RAPTOR's single-pass design avoids the error accumulation and latency of iterative approaches. Its core innovation is Efficient Video Attention (EVA), a novel translator module that factorizes spatiotemporal modeling. Instead of processing flattened spacetime tokens with $O((ST)^2)$ or $O(ST)$ complexity, EVA alternates operations along the spatial (S) and temporal (T) axes. This factorization reduces the time complexity to $O(S + T)$ and memory complexity to $O(max(S, T))$, enabling global context modeling at $512^2$ resolution and beyond, operating directly on dense feature maps with a patch-free design. Complementing this architecture is a 3-stage training curriculum that progressively refines predictions from coarse structure to sharp, temporally coherent details. Experiments show RAPTOR is the first predictor to exceed 30 FPS on a Jetson AGX Orin for $512^2$ video, setting a new state-of-the-art on UAVid, KTH, and a custom high-resolution dataset in PSNR, SSIM, and LPIPS. Critically, RAPTOR boosts the mission success rate in a real-world UAV navigation task by 18/%, paving the way for safer and more anticipatory embodied agents.
视频预测面临着一个基本的三难困境:实现高分辨率和感知质量通常会牺牲实时速度,这阻碍了其在延迟关键应用中的使用。对于自主无人机(UAV)在密集城市环境中而言,这一挑战尤为严峻,因为从高清图像中预见事件对安全至关重要。现有方法依赖于迭代生成(如扩散模型、自回归模型)或二次复杂度注意力机制,在边缘硬件上无法满足这些严格要求。为打破这种长期存在的权衡关系,我们引入了RAPTOR架构,它在实现实时和高分辨率性能方面取得了突破。 **RAPTOR的关键特性:** 1. **一次性设计**:避免了迭代方法中的误差累积和延迟问题。 2. **高效视频注意力(EVA)模块**:作为核心创新点,该模块通过对空间(S)和时间(T)轴交替操作来分解时空建模。这种分解将时间复杂度降低至$O(S + T)$,内存复杂度降至$O(\max(S, T))$,从而支持在512^2分辨率及以上的全局上下文模型,并直接处理无块设计的密集特征图。 **训练体系结构:** 3. **三阶段训练课程**:此方法逐步细化预测从粗略结构到精细、时间连贯细节的过程。 实验结果显示,在Jetson AGX Orin上,RAPTOR是第一个在512^2视频中超过每秒30帧的预测器,并且在UAVid、KTH及自定义高分辨率数据集上分别在PSNR(峰值信噪比)、SSIM(结构相似度指标)和LPIPS(感知线性图像嵌入距离)方面设置了新的行业标准。最重要的是,RAPTOR在现实世界中的无人机导航任务中将成功率提高了18%,为更安全、更具预见性的自主体铺平了道路。 通过这些创新,RAPTOR架构不仅解决了视频预测领域的一个核心问题,还显著提升了实时高分辨率应用的性能,特别是对于需要即时决策的应用场景,如无人驾驶航空器。
https://arxiv.org/abs/2512.21710
We present STORM (Search-Guided Generative World Models), a novel framework for spatio-temporal reasoning in robotic manipulation that unifies diffusion-based action generation, conditional video prediction, and search-based planning. Unlike prior Vision-Language-Action (VLA) models that rely on abstract latent dynamics or delegate reasoning to language components, STORM grounds planning in explicit visual rollouts, enabling interpretable and foresight-driven decision-making. A diffusion-based VLA policy proposes diverse candidate actions, a generative video world model simulates their visual and reward outcomes, and Monte Carlo Tree Search (MCTS) selectively refines plans through lookahead evaluation. Experiments on the SimplerEnv manipulation benchmark demonstrate that STORM achieves a new state-of-the-art average success rate of 51.0 percent, outperforming strong baselines such as CogACT. Reward-augmented video prediction substantially improves spatio-temporal fidelity and task relevance, reducing Frechet Video Distance by over 75 percent. Moreover, STORM exhibits robust re-planning and failure recovery behavior, highlighting the advantages of search-guided generative world models for long-horizon robotic manipulation.
我们提出了一种名为STORM(基于搜索引导的生成式世界模型)的新框架,该框架用于机器人操作中的时空推理。STORM将扩散动作生成、条件视频预测和基于搜索的规划统一起来。与之前的视觉-语言-行动(VLA)模型不同,这些模型依赖于抽象潜在动力学或将推理委托给语言组件,STORM通过显式的视觉展开来实现计划制定,从而支持可解释且具有前瞻性驱动决策能力的操作。 具体来说,一个基于扩散的VLA策略提出多样化的候选动作,生成式视频世界模型模拟了它们的视觉和奖励结果,而蒙特卡洛树搜索(MCTS)则通过前瞻评估选择性地优化计划。在SimplerEnv操作基准测试中进行的实验表明,STORM达到了新的最先进平均成功率51.0%,优于诸如CogACT等强大的基线模型。 奖励增强型视频预测显著提高了时空准确性和任务相关性,将Frechet视频距离减少了超过75%。此外,STORM展示了稳健的重新规划和故障恢复行为,强调了搜索引导生成式世界模型在长时域机器人操作中的优势。
https://arxiv.org/abs/2512.18477
Robotic arm manipulation in data-scarce settings is a highly challenging task due to the complex embodiment dynamics and diverse contexts. Recent video-based approaches have shown great promise in capturing and transferring the temporal and physical interactions by pre-training on Internet-scale video data. However, such methods are often not optimized for the embodiment-specific closed-loop control, typically suffering from high latency and insufficient grounding. In this paper, we present Vidarc (Video Diffusion for Action Reasoning and Closed-loop Control), a novel autoregressive embodied video diffusion approach augmented by a masked inverse dynamics model. By grounding video predictions with action-relevant masks and incorporating real-time feedback through cached autoregressive generation, Vidarc achieves fast, accurate closed-loop control. Pre-trained on one million cross-embodiment episodes, Vidarc surpasses state-of-the-art baselines, achieving at least a 15% higher success rate in real-world deployment and a 91% reduction in latency. We also highlight its robust generalization and error correction capabilities across previously unseen robotic platforms.
在数据匮乏的环境中进行机器人手臂操作是一项极具挑战性的任务,因为需要处理复杂的物理动力学和多种多样的场景。近期基于视频的方法通过在互联网规模的视频数据上预训练,展现出了捕捉并转移时序与物理交互的强大潜力。然而,这些方法通常并未针对特定执行体(如机器人臂)的闭环控制进行优化,往往存在高延迟和不够精确的问题。 本文介绍了Vidarc(Video Diffusion for Action Reasoning and Closed-loop Control),一种新颖的基于视频扩散的自回归式具身化方法,并通过加入掩码逆动力学模型得到增强。该方法通过使用与动作相关的掩模来将视频预测与实际操作关联起来,同时利用缓存的自回归生成方式实时反馈调整,实现了快速且精准的闭环控制。经过针对不同执行体百万级别的数据集预训练后,Vidarc在真实环境中的部署中,相较于现有最佳基准模型成功率提高了至少15%,并且延迟降低了91%。此外,本文还强调了其在未见过的新机器人平台上的鲁棒性泛化和错误校正能力。 这一技术的进步为解决机器人手臂操作领域内的数据不足问题提供了一种新的有效途径,并展示了自回归视频扩散方法在未来具身智能系统中的巨大潜力。
https://arxiv.org/abs/2512.17661
Recent advances in diffusion transformers have empowered video generation models to generate high-quality video clips from texts or images. However, world models with the ability to predict long-horizon futures from past observations and actions remain underexplored, especially for general-purpose scenarios and various forms of actions. To bridge this gap, we introduce Astra, an interactive general world model that generates real-world futures for diverse scenarios (e.g., autonomous driving, robot grasping) with precise action interactions (e.g., camera motion, robot action). We propose an autoregressive denoising architecture and use temporal causal attention to aggregate past observations and support streaming outputs. We use a noise-augmented history memory to avoid over-reliance on past frames to balance responsiveness with temporal coherence. For precise action control, we introduce an action-aware adapter that directly injects action signals into the denoising process. We further develop a mixture of action experts that dynamically route heterogeneous action modalities, enhancing versatility across diverse real-world tasks such as exploration, manipulation, and camera control. Astra achieves interactive, consistent, and general long-term video prediction and supports various forms of interactions. Experiments across multiple datasets demonstrate the improvements of Astra in fidelity, long-range prediction, and action alignment over existing state-of-the-art world models.
最近的扩散变换器进展使视频生成模型能够从文本或图像中生成高质量的视频片段。然而,具有预测长期未来的能力的世界模型(基于过去观察和行动)在一般用途场景以及各种形式的动作方面仍然较少被探索。为了填补这一空白,我们引入了Astra,这是一个交互式通用世界模型,它为多样化的场景(如自动驾驶、机器人抓取等)生成现实世界的未来,并能够进行精确的动作互动(如相机移动、机器人动作)。我们提出了一个自回归去噪架构,并使用时间因果注意力机制来聚合过去的观察结果并支持流输出。为了不依赖于过去的帧而过度,我们采用了一种噪声增强的历史记忆方法,以平衡响应性与时间一致性。对于精确的动作控制,我们引入了一个感知动作的适配器,它可以直接将动作信号注入去噪过程中。此外,我们开发了一组行动专家混合模型,这些模型能够动态地路由不同形式的动作模态,增强了在探索、操作和相机控制等多样化现实世界任务中的适应性。Astra实现了交互式的长期视频预测,并支持多种形式的互动。跨多个数据集的实验显示了Astra在精确度、长范围预测及动作对齐方面相对于现有最先进的世界模型的进步。
https://arxiv.org/abs/2512.08931
In this work, we investigate diffusion-based video prediction models, which forecast future video frames, for continuous video streams. In this context, the models observe continuously new training samples, and we aim to leverage this to improve their predictions. We thus propose an approach that continuously adapts a pre-trained diffusion model to a video stream. Since fine-tuning the parameters of a large diffusion model is too expensive, we refine the diffusion noise during inference while keeping the model parameters frozen, allowing the model to adaptively determine suitable sampling noise. We term the approach Sequence Adaptive Video Prediction with Diffusion Noise Optimization (SAVi-DNO). To validate our approach, we introduce a new evaluation setting on the Ego4D dataset, focusing on simultaneous adaptation and evaluation on long continuous videos. Empirical results demonstrate improved performance based on FVD, SSIM, and PSNR metrics on long videos of Ego4D and OpenDV-YouTube, as well as videos of UCF-101 and SkyTimelapse, showcasing SAVi-DNO's effectiveness.
https://arxiv.org/abs/2511.18255