Paper Reading AI Learner

Consistent World Models via Foresight Diffusion

2025-05-22 10:01:59
Yu Zhang, Xingzhuo Guo, Haoran Xu, Mingsheng Long

Abstract

Diffusion and flow-based models have enabled significant progress in generation tasks across various modalities and have recently found applications in world modeling. However, unlike typical generation tasks that encourage sample diversity, world models entail different sources of uncertainty and require consistent samples aligned with the ground-truth trajectory, which is a limitation we empirically observe in diffusion models. We argue that a key bottleneck in learning consistent diffusion-based world models lies in the suboptimal predictive ability, which we attribute to the entanglement of condition understanding and target denoising within shared architectures and co-training schemes. To address this, we propose Foresight Diffusion (ForeDiff), a diffusion-based world modeling framework that enhances consistency by decoupling condition understanding from target denoising. ForeDiff incorporates a separate deterministic predictive stream to process conditioning inputs independently of the denoising stream, and further leverages a pretrained predictor to extract informative representations that guide generation. Extensive experiments on robot video prediction and scientific spatiotemporal forecasting show that ForeDiff improves both predictive accuracy and sample consistency over strong baselines, offering a promising direction for diffusion-based world models.

Abstract (translated)

扩散模型和流模型在各种模式的生成任务中取得了显著进展,并且最近被应用于世界建模。然而,与鼓励样本多样性的典型生成任务不同,世界模型涉及不同的不确定性来源,并需要与真实轨迹一致的样本,这是我们通过实验观察到的扩散模型的一个局限性。我们认为,在学习一致性扩散世界的模型时,关键瓶颈在于预测能力欠佳,这归因于在共享架构和联合训练方案中条件理解与目标去噪之间的纠缠。 为了解决这个问题,我们提出了“Foresight Diffusion”(简称ForeDiff),这是一种通过将条件理解和目标去噪解耦来增强一致性的基于扩散的世界建模框架。ForeDiff包含一个独立的确定性预测流,用于在不依赖于去噪流的情况下处理条件输入,并进一步利用预训练的预测器提取有助于生成过程的信息表示。 机器人视频预测和科学时空预报方面的广泛实验表明,相比于强大的基准模型,ForeDiff在提高预测准确性和样本一致性方面取得了显著进展,为基于扩散的世界建模提供了一个有前景的方向。

URL

https://arxiv.org/abs/2505.16474

PDF

https://arxiv.org/pdf/2505.16474.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot