Abstract
Recent techniques for text-to-4D generation synthesize dynamic 3D scenes using supervision from pre-trained text-to-video models. However, existing representations for motion, such as deformation models or time-dependent neural representations, are limited in the amount of motion they can generate-they cannot synthesize motion extending far beyond the bounding box used for volume rendering. The lack of a more flexible motion model contributes to the gap in realism between 4D generation methods and recent, near-photorealistic video generation models. Here, we propose TC4D: trajectory-conditioned text-to-4D generation, which factors motion into global and local components. We represent the global motion of a scene's bounding box using rigid transformation along a trajectory parameterized by a spline. We learn local deformations that conform to the global trajectory using supervision from a text-to-video model. Our approach enables the synthesis of scenes animated along arbitrary trajectories, compositional scene generation, and significant improvements to the realism and amount of generated motion, which we evaluate qualitatively and through a user study. Video results can be viewed on our website: this https URL.
Abstract (translated)
近年来,基于预训练的文本到视频模型的监督文本-4D生成技术已经出现。然而,现有的运动表示方法,如变形模型或时间依赖的神经表示,在生成运动方面的能力有限,它们无法生成超出体积渲染 bounding 盒的范围之外的运动。缺乏更灵活的运动模型导致4D生成方法和最近近实感的视频生成模型之间的现实 gap。在这里,我们提出 TC4D:轨迹条件文本-4D生成,其中将运动因素化为全局和局部组件。我们通过参数化由平滑曲线定义的轨迹上的全局运动来表示场景的边界框全局运动。我们通过监督来自文本到视频模型来学习局部变形,使其符合全局轨迹。我们的方法能够合成沿着任意轨迹运动的场景,进行级联场景生成,以及显著提高真实感和生成运动的数量。我们通过定性评估和用户研究来评估这种方法的质量和效果。视频结果可以在我们的网站上查看:https://www.的这个URL。
URL
https://arxiv.org/abs/2403.17920