Paper Reading AI Learner

Generating Human Interaction Motions in Scenes with Text Control

2024-04-16 16:04:38
Hongwei Yi, Justus Thies, Michael J. Black, Xue Bin Peng, Davis Rempe

Abstract

We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models. Previous text-to-motion methods focus on characters in isolation without considering scenes due to the limited availability of datasets that include motion, text descriptions, and interactive scenes. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model, emphasizing goal-reaching constraints on large-scale motion-capture datasets. We then enhance this model with a scene-aware component, fine-tuned using data augmented with detailed scene information, including ground plane and object shapes. To facilitate training, we embed annotated navigation and interaction motions within scenes. The proposed method produces realistic and diverse human-object interactions, such as navigation and sitting, in different scenes with various object shapes, orientations, initial body positions, and poses. Extensive experiments demonstrate that our approach surpasses prior techniques in terms of the plausibility of human-scene interactions, as well as the realism and variety of the generated motions. Code will be released upon publication of this work at this https URL.

Abstract (translated)

我们提出了Tesmo,一种基于去噪扩散模型的文本控制场景感知运动生成方法。之前的方法主要关注单独的文本到运动,而忽略了场景,因为包含运动、文本描述和交互场景的数据集有限。我们的方法从预训练一个场景无关的文本到运动扩散模型开始,强调在大型运动捕捉数据集中实现目标达成的约束。然后,我们通过使用数据增强来微调这个模型,包括地面平面和对象的形状。为了方便训练,我们在场景中嵌入带注释的导航和交互运动。所提出的方法在不同场景和各种物体形状、姿态和姿势下,产生了真实和多样的人机交互,如导航和坐着。大量实验证明,我们的方法在人类场景交互的可信度、真实感和生成运动的多样性方面超越了先前的技术。代码将在本文发表后在https://这个 URL上发布。

URL

https://arxiv.org/abs/2404.10685

PDF

https://arxiv.org/pdf/2404.10685.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot