Abstract
Recent developments in 2D visual generation have been remarkably successful. However, 3D and 4D generation remain challenging in real-world applications due to the lack of large-scale 4D data and effective model design. In this paper, we propose to jointly investigate general 3D and 4D generation by leveraging camera and object movements commonly observed in daily life. Due to the lack of real-world 4D data in the community, we first propose a data curation pipeline to obtain camera poses and object motion strength from videos. Based on this pipeline, we introduce a large-scale real-world 4D scene dataset: CamVid-30K. By leveraging all the 3D and 4D data, we develop our framework, GenXD, which allows us to produce any 3D or 4D scene. We propose multiview-temporal modules, which disentangle camera and object movements, to seamlessly learn from both 3D and 4D data. Additionally, GenXD employs masked latent conditions to support a variety of conditioning views. GenXD can generate videos that follow the camera trajectory as well as consistent 3D views that can be lifted into 3D representations. We perform extensive evaluations across various real-world and synthetic datasets, demonstrating GenXD's effectiveness and versatility compared to previous methods in 3D and 4D generation.
Abstract (translated)
近期在二维视觉生成方面的发展取得了显著的成功。然而,由于缺乏大规模的四维数据和有效模型设计,在实际应用中三维和四维生成仍然具有挑战性。本文提出通过利用日常生活中常见的摄像机运动和物体运动来共同研究通用的三维和四维生成问题。鉴于社区内缺乏真实世界的四维数据,我们首先提出了一个数据整理流程,以从视频中获取相机姿态和物体运动强度。基于这一流程,我们引入了一个大规模的真实世界四维场景数据集:CamVid-30K。通过充分利用所有三维和四维的数据,我们开发了我们的框架GenXD,它使我们能够生成任何三维或四维的场景。我们提出了多视角时间模块来分离相机运动和物体运动,从而可以从三维和四维数据中无缝学习。此外,GenXD采用掩码潜变量条件以支持多种条件视图。GenXD可以生成遵循摄像机轨迹的视频以及一致的三维视图,并且这些视图可以提升为三维表示形式。我们在各种真实世界和合成的数据集上进行了广泛的评估,展示了与之前的三维和四维生成方法相比,GenXD的有效性和多功能性。
URL
https://arxiv.org/abs/2411.02319