Abstract
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models. Previous text-to-motion methods focus on characters in isolation without considering scenes due to the limited availability of datasets that include motion, text descriptions, and interactive scenes. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model, emphasizing goal-reaching constraints on large-scale motion-capture datasets. We then enhance this model with a scene-aware component, fine-tuned using data augmented with detailed scene information, including ground plane and object shapes. To facilitate training, we embed annotated navigation and interaction motions within scenes. The proposed method produces realistic and diverse human-object interactions, such as navigation and sitting, in different scenes with various object shapes, orientations, initial body positions, and poses. Extensive experiments demonstrate that our approach surpasses prior techniques in terms of the plausibility of human-scene interactions, as well as the realism and variety of the generated motions. Code will be released upon publication of this work at this https URL.
Abstract (translated)
我们提出了Tesmo,一种基于去噪扩散模型的文本控制场景感知运动生成方法。之前的方法主要关注单独的文本到运动,而忽略了场景,因为包含运动、文本描述和交互场景的数据集有限。我们的方法从预训练一个场景无关的文本到运动扩散模型开始,强调在大型运动捕捉数据集中实现目标达成的约束。然后,我们通过使用数据增强来微调这个模型,包括地面平面和对象的形状。为了方便训练,我们在场景中嵌入带注释的导航和交互运动。所提出的方法在不同场景和各种物体形状、姿态和姿势下,产生了真实和多样的人机交互,如导航和坐着。大量实验证明,我们的方法在人类场景交互的可信度、真实感和生成运动的多样性方面超越了先前的技术。代码将在本文发表后在https://这个 URL上发布。
URL
https://arxiv.org/abs/2404.10685