Abstract
World simulation has gained increasing popularity due to its ability to model virtual environments and predict the consequences of actions. However, the limited temporal context window often leads to failures in maintaining long-term consistency, particularly in preserving 3D spatial consistency. In this work, we present WorldMem, a framework that enhances scene generation with a memory bank consisting of memory units that store memory frames and states (e.g., poses and timestamps). By employing a memory attention mechanism that effectively extracts relevant information from these memory frames based on their states, our method is capable of accurately reconstructing previously observed scenes, even under significant viewpoint or temporal gaps. Furthermore, by incorporating timestamps into the states, our framework not only models a static world but also captures its dynamic evolution over time, enabling both perception and interaction within the simulated world. Extensive experiments in both virtual and real scenarios validate the effectiveness of our approach.
Abstract (translated)
世界模拟由于其能够建模虚拟环境并预测行动后果而日益流行。然而,有限的时间上下文窗口经常导致在保持长期一致性方面出现问题,特别是在维持3D空间一致性上。为此,我们提出了WorldMem框架,该框架通过引入一个存储记忆单元的记忆库来增强场景生成能力,这些记忆单元存储了记忆帧和状态(如姿态和时间戳)。通过采用有效的记忆注意力机制从这些记忆帧中提取相关信息,我们的方法能够在显著的视角或时间间隔下准确重构先前观察到的场景。此外,通过在状态中加入时间戳,我们的框架不仅能够模拟静态世界,还能捕捉其随时间的变化发展过程,从而支持感知和与虚拟世界的交互。我们在虚拟和真实场景中的大量实验验证了我们这种方法的有效性。
URL
https://arxiv.org/abs/2504.12369