Abstract
The integration of geometric reconstruction and generative modeling remains a critical challenge in developing AI systems capable of human-like spatial reasoning. This paper proposes Aether, a unified framework that enables geometry-aware reasoning in world models by jointly optimizing three core capabilities: (1) 4D dynamic reconstruction, (2) action-conditioned video prediction, and (3) goal-conditioned visual planning. Through task-interleaved feature learning, Aether achieves synergistic knowledge sharing across reconstruction, prediction, and planning objectives. Building upon video generation models, our framework demonstrates unprecedented synthetic-to-real generalization despite never observing real-world data during training. Furthermore, our approach achieves zero-shot generalization in both action following and reconstruction tasks, thanks to its intrinsic geometric modeling. Remarkably, even without real-world data, its reconstruction performance far exceeds that of domain-specific models. Additionally, Aether leverages a geometry-informed action space to seamlessly translate predictions into actions, enabling effective autonomous trajectory planning. We hope our work inspires the community to explore new frontiers in physically-reasonable world modeling and its applications.
Abstract (translated)
几何重建与生成模型的集成仍然是开发能够进行类似人类空间推理的人工智能系统的关键挑战。本文提出了Aether,这是一个统一框架,通过联合优化三种核心能力来实现对世界模型具有几何意识的推理:(1)4D动态重建;(2)基于行动条件的视频预测;以及(3)基于目标条件的视觉规划。借助任务交错特征学习,Aether实现了跨重建、预测和规划目标的知识协同共享。在构建于视频生成模型的基础上,我们的框架即使从未接触过真实世界的数据,在合成到现实世界的泛化能力上也表现出前所未有的水平。此外,由于其内在的几何建模,我们的方法在遵循动作任务以及重建任务中都实现了零样本泛化(zero-shot generalization)。特别值得注意的是,在没有实际数据的情况下,它的重建性能远超领域特定模型。此外,Aether利用一个基于几何信息的动作空间来无缝地将预测转化为行动,从而实现有效的自主轨迹规划。我们希望我们的工作能够激励社区探索物理合理世界建模及其应用的新前沿。
URL
https://arxiv.org/abs/2503.18945