Paper Reading AI Learner

WonderPlay: Dynamic 3D Scene Generation from a Single Image and Actions

2025-05-23 17:59:24
Zizhang Li, Hong-Xing Yu, Wei Liu, Yin Yang, Charles Herrmann, Gordon Wetzstein, Jiajun Wu

Abstract

WonderPlay is a novel framework integrating physics simulation with video generation for generating action-conditioned dynamic 3D scenes from a single image. While prior works are restricted to rigid body or simple elastic dynamics, WonderPlay features a hybrid generative simulator to synthesize a wide range of 3D dynamics. The hybrid generative simulator first uses a physics solver to simulate coarse 3D dynamics, which subsequently conditions a video generator to produce a video with finer, more realistic motion. The generated video is then used to update the simulated dynamic 3D scene, closing the loop between the physics solver and the video generator. This approach enables intuitive user control to be combined with the accurate dynamics of physics-based simulators and the expressivity of diffusion-based video generators. Experimental results demonstrate that WonderPlay enables users to interact with various scenes of diverse content, including cloth, sand, snow, liquid, smoke, elastic, and rigid bodies -- all using a single image input. Code will be made public. Project website: this https URL

Abstract (translated)

WonderPlay 是一个新颖的框架,它将物理模拟与视频生成相结合,可以从单一图像中生成条件化的动态3D场景。虽然先前的工作仅限于刚体或简单的弹性动力学,但 WonderPlay 特别设计了一个混合生成式仿真器来合成各种各样的 3D 动力学。该混合生成式仿真器首先使用物理求解器模拟粗糙的 3D 动力学,然后利用视频生成器在这些基础之上产生更精细、更具现实感的动作视频。接着,所生成的视频被用来更新动态的 3D 场景,形成一个闭环过程,在其中物理求解器与视频生成器之间相互作用。这种方案使得用户能够直观地控制场景,并且结合了基于物理模拟器的精确动力学和扩散基础视频生成器的表现力。 实验结果表明,WonderPlay 允许用户通过单一图像输入与包含布料、沙子、雪、液体、烟雾、弹性体以及刚性物体等不同内容的各种场景进行互动。代码将会公开发布。项目网站:[此链接](https://this.url/)

URL

https://arxiv.org/abs/2505.18151

PDF

https://arxiv.org/pdf/2505.18151.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot