Abstract
This paper investigates training better visual world models for robot manipulation, i.e., models that can predict future visual observations by conditioning on past frames and robot actions. Specifically, we consider world models that operate on RGB-D frames (RGB-D world models). As opposed to canonical approaches that handle dynamics prediction mostly implicitly and reconcile it with visual rendering in a single model, we introduce FlowDreamer, which adopts 3D scene flow as explicit motion representations. FlowDreamer first predicts 3D scene flow from past frame and action conditions with a U-Net, and then a diffusion model will predict the future frame utilizing the scene flow. FlowDreamer is trained end-to-end despite its modularized nature. We conduct experiments on 4 different benchmarks, covering both video prediction and visual planning tasks. The results demonstrate that FlowDreamer achieves better performance compared to other baseline RGB-D world models by 7% on semantic similarity, 11% on pixel quality, and 6% on success rate in various robot manipulation domains.
Abstract (translated)
这篇论文探讨了为机器人操作训练更优质的视觉世界模型的方法,即能够根据过去的帧和机器人的动作来预测未来的视觉观察结果的模型。具体来说,我们研究的是在RGB-D图像(RGB-D世界模型)上运行的世界模型。与传统的大多数处理动态预测的方法不同,这些方法大多是隐式的,并将这种预测与可视渲染结合在一个单一的模型中进行整合,我们引入了一种名为FlowDreamer的新方法,它采用3D场景流作为显式运动表示。 FlowDreamer首先使用U-Net从过去的帧和动作条件中预测3D场景流,然后一个扩散模型会利用这些场景流来预测未来的帧。尽管具有模块化结构,FlowDreamer仍然可以进行端到端的训练。我们在4个不同的基准测试上进行了实验,涵盖了视频预测和视觉规划任务。结果表明,在各种机器人操作领域中,FlowDreamer在语义相似性、像素质量和成功率方面分别比其他基线RGB-D世界模型提高了7%、11%和6%。
URL
https://arxiv.org/abs/2505.10075