Paper Reading AI Learner

Steerable Scene Generation with Post Training and Inference-Time Search

2025-05-07 22:07:42
Nicholas Pfaff, Hongkai Dai, Sergey Zakharov, Shun Iwase, Russ Tedrake

Abstract

Training robots in simulation requires diverse 3D scenes that reflect the specific challenges of downstream tasks. However, scenes that satisfy strict task requirements, such as high-clutter environments with plausible spatial arrangement, are rare and costly to curate manually. Instead, we generate large-scale scene data using procedural models that approximate realistic environments for robotic manipulation, and adapt it to task-specific goals. We do this by training a unified diffusion-based generative model that predicts which objects to place from a fixed asset library, along with their SE(3) poses. This model serves as a flexible scene prior that can be adapted using reinforcement learning-based post training, conditional generation, or inference-time search, steering generation toward downstream objectives even when they differ from the original data distribution. Our method enables goal-directed scene synthesis that respects physical feasibility and scales across scene types. We introduce a novel MCTS-based inference-time search strategy for diffusion models, enforce feasibility via projection and simulation, and release a dataset of over 44 million SE(3) scenes spanning five diverse environments. Website with videos, code, data, and model weights: this https URL

Abstract (translated)

在仿真中训练机器人需要反映下游任务特定挑战的多样化三维场景。然而,满足严格任务要求(如具有合理空间布局的高杂乱环境)的场景非常罕见且难以手动整理。因此,我们利用过程模型生成大规模场景数据,这些模型可以近似现实环境中机器人的操作需求,并将这些数据调整以适应具体任务目标。为此,我们训练了一个统一的基于扩散的生成模型,该模型能预测从固定资产库中放置哪些物体及其 SE(3) 位姿(位置和姿态)。此模型作为一个灵活的场景先验知识可以利用强化学习后训练、条件生成或推理时间搜索进行调整,即使这些目标与原始数据分布不同也能引导生成朝向下游目标发展。我们的方法实现了尊重物理可行性的目标导向场景合成,并且能够跨各种场景类型扩展。 我们还引入了一种基于MCTS(蒙特卡罗树搜索)的推理时间搜索策略应用于扩散模型中,并通过投影和模拟来确保可行性,同时发布了一个包含超过4400万SE(3)场景的数据集,这些场景涵盖了五种多样的环境。网站提供了视频、代码、数据和模型权重:[此链接](https://example.com/)(请根据实际链接替换)。

URL

https://arxiv.org/abs/2505.04831

PDF

https://arxiv.org/pdf/2505.04831.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot