Abstract
Training robots in simulation requires diverse 3D scenes that reflect the specific challenges of downstream tasks. However, scenes that satisfy strict task requirements, such as high-clutter environments with plausible spatial arrangement, are rare and costly to curate manually. Instead, we generate large-scale scene data using procedural models that approximate realistic environments for robotic manipulation, and adapt it to task-specific goals. We do this by training a unified diffusion-based generative model that predicts which objects to place from a fixed asset library, along with their SE(3) poses. This model serves as a flexible scene prior that can be adapted using reinforcement learning-based post training, conditional generation, or inference-time search, steering generation toward downstream objectives even when they differ from the original data distribution. Our method enables goal-directed scene synthesis that respects physical feasibility and scales across scene types. We introduce a novel MCTS-based inference-time search strategy for diffusion models, enforce feasibility via projection and simulation, and release a dataset of over 44 million SE(3) scenes spanning five diverse environments. Website with videos, code, data, and model weights: this https URL
Abstract (translated)
在仿真中训练机器人需要反映下游任务特定挑战的多样化三维场景。然而,满足严格任务要求(如具有合理空间布局的高杂乱环境)的场景非常罕见且难以手动整理。因此,我们利用过程模型生成大规模场景数据,这些模型可以近似现实环境中机器人的操作需求,并将这些数据调整以适应具体任务目标。为此,我们训练了一个统一的基于扩散的生成模型,该模型能预测从固定资产库中放置哪些物体及其 SE(3) 位姿(位置和姿态)。此模型作为一个灵活的场景先验知识可以利用强化学习后训练、条件生成或推理时间搜索进行调整,即使这些目标与原始数据分布不同也能引导生成朝向下游目标发展。我们的方法实现了尊重物理可行性的目标导向场景合成,并且能够跨各种场景类型扩展。 我们还引入了一种基于MCTS(蒙特卡罗树搜索)的推理时间搜索策略应用于扩散模型中,并通过投影和模拟来确保可行性,同时发布了一个包含超过4400万SE(3)场景的数据集,这些场景涵盖了五种多样的环境。网站提供了视频、代码、数据和模型权重:[此链接](https://example.com/)(请根据实际链接替换)。
URL
https://arxiv.org/abs/2505.04831