Paper Reading AI Learner

PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation

2024-09-27 17:59:57
Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, Shenlong Wang

Abstract

We present PhysGen, a novel image-to-video generation method that converts a single image and an input condition (e.g., force and torque applied to an object in the image) to produce a realistic, physically plausible, and temporally consistent video. Our key insight is to integrate model-based physical simulation with a data-driven video generation process, enabling plausible image-space dynamics. At the heart of our system are three core components: (i) an image understanding module that effectively captures the geometry, materials, and physical parameters of the image; (ii) an image-space dynamics simulation model that utilizes rigid-body physics and inferred parameters to simulate realistic behaviors; and (iii) an image-based rendering and refinement module that leverages generative video diffusion to produce realistic video footage featuring the simulated motion. The resulting videos are realistic in both physics and appearance and are even precisely controllable, showcasing superior results over existing data-driven image-to-video generation works through quantitative comparison and comprehensive user study. PhysGen's resulting videos can be used for various downstream applications, such as turning an image into a realistic animation or allowing users to interact with the image and create various dynamics. Project page: this https URL

Abstract (translated)

我们提出了PhysGen,一种新颖的图像转视频生成方法,可以将单个图像和输入条件(例如作用于图像中物体的力和扭矩)转换为生产具有真实感、物理可信度和时间一致性的视频。我们的关键洞见是将基于模型的物理仿真与数据驱动的视频生成过程相结合,实现合理的图像空间动态。 我们系统的核心组件包括:(i)一个图像理解模块,有效捕捉图像的几何、材料和物理参数;(ii)一个图像空间动态仿真模型,利用刚体物理和推断参数模拟真实行为;(iii)一个基于生成视频扩散的图像基于渲染和精度的模块,利用生成的视频来制作具有模拟运动的视频 footage。 通过定量比较和全面用户研究,PhysGen生成的视频在物理和外观方面都是真实的,甚至可以精确控制,展示了通过定量比较和全面用户研究超过现有数据驱动图像到视频生成工作的优越性。PhysGen生成的视频可以用于各种下游应用,例如将图像转换为真实的动画,或者允许用户与图像交互并创建各种动态。项目页面:此链接

URL

https://arxiv.org/abs/2409.18964

PDF

https://arxiv.org/pdf/2409.18964.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot