Abstract
We present MAV3D (Make-A-Video3D), a method for generating three-dimensional dynamic scenes from text descriptions. Our approach uses a 4D dynamic Neural Radiance Field (NeRF), which is optimized for scene appearance, density, and motion consistency by querying a Text-to-Video (T2V) diffusion-based model. The dynamic video output generated from the provided text can be viewed from any camera location and angle, and can be composited into any 3D environment. MAV3D does not require any 3D or 4D data and the T2V model is trained only on Text-Image pairs and unlabeled videos. We demonstrate the effectiveness of our approach using comprehensive quantitative and qualitative experiments and show an improvement over previously established internal baselines. To the best of our knowledge, our method is the first to generate 3D dynamic scenes given a text description.
Abstract (translated)
我们提出了MAV3D(Make-A-Video3D),一种从文本描述生成三维动态场景的方法。我们的方法使用了一种4D的动态神经网络辐射场(NeRF),通过查询文本到视频(T2V)扩散模型优化了场景外观、密度和运动一致性。从提供的文字生成的动态视频输出可以从任何相机位置和角度观看,可以组合成任何3D环境。MAV3D不需要任何3D或4D数据,T2V模型仅基于文本图像对和未标记的视频训练。我们使用全面量化和定性实验证明了我们方法的有效性,并展示了相对于之前建立的内部基准线的改进。据我们所知,我们的方法是第一个根据文本描述生成3D动态场景的方法。
URL
https://arxiv.org/abs/2301.11280