Abstract
Diffusion and flow-based models have enabled significant progress in generation tasks across various modalities and have recently found applications in world modeling. However, unlike typical generation tasks that encourage sample diversity, world models entail different sources of uncertainty and require consistent samples aligned with the ground-truth trajectory, which is a limitation we empirically observe in diffusion models. We argue that a key bottleneck in learning consistent diffusion-based world models lies in the suboptimal predictive ability, which we attribute to the entanglement of condition understanding and target denoising within shared architectures and co-training schemes. To address this, we propose Foresight Diffusion (ForeDiff), a diffusion-based world modeling framework that enhances consistency by decoupling condition understanding from target denoising. ForeDiff incorporates a separate deterministic predictive stream to process conditioning inputs independently of the denoising stream, and further leverages a pretrained predictor to extract informative representations that guide generation. Extensive experiments on robot video prediction and scientific spatiotemporal forecasting show that ForeDiff improves both predictive accuracy and sample consistency over strong baselines, offering a promising direction for diffusion-based world models.
Abstract (translated)
扩散模型和流模型在各种模式的生成任务中取得了显著进展,并且最近被应用于世界建模。然而,与鼓励样本多样性的典型生成任务不同,世界模型涉及不同的不确定性来源,并需要与真实轨迹一致的样本,这是我们通过实验观察到的扩散模型的一个局限性。我们认为,在学习一致性扩散世界的模型时,关键瓶颈在于预测能力欠佳,这归因于在共享架构和联合训练方案中条件理解与目标去噪之间的纠缠。 为了解决这个问题,我们提出了“Foresight Diffusion”(简称ForeDiff),这是一种通过将条件理解和目标去噪解耦来增强一致性的基于扩散的世界建模框架。ForeDiff包含一个独立的确定性预测流,用于在不依赖于去噪流的情况下处理条件输入,并进一步利用预训练的预测器提取有助于生成过程的信息表示。 机器人视频预测和科学时空预报方面的广泛实验表明,相比于强大的基准模型,ForeDiff在提高预测准确性和样本一致性方面取得了显著进展,为基于扩散的世界建模提供了一个有前景的方向。
URL
https://arxiv.org/abs/2505.16474