Paper Reading AI Learner

EnergyMoGen: Compositional Human Motion Generation with Energy-Based Diffusion Model in Latent Space

2024-12-19 10:19:43
Jianrong Zhang, Hehe Fan, Yi Yang

Abstract

Diffusion models, particularly latent diffusion models, have demonstrated remarkable success in text-driven human motion generation. However, it remains challenging for latent diffusion models to effectively compose multiple semantic concepts into a single, coherent motion sequence. To address this issue, we propose EnergyMoGen, which includes two spectrums of Energy-Based Models: (1) We interpret the diffusion model as a latent-aware energy-based model that generates motions by composing a set of diffusion models in latent space; (2) We introduce a semantic-aware energy model based on cross-attention, which enables semantic composition and adaptive gradient descent for text embeddings. To overcome the challenges of semantic inconsistency and motion distortion across these two spectrums, we introduce Synergistic Energy Fusion. This design allows the motion latent diffusion model to synthesize high-quality, complex motions by combining multiple energy terms corresponding to textual descriptions. Experiments show that our approach outperforms existing state-of-the-art models on various motion generation tasks, including text-to-motion generation, compositional motion generation, and multi-concept motion generation. Additionally, we demonstrate that our method can be used to extend motion datasets and improve the text-to-motion task.

Abstract (translated)

扩散模型,尤其是潜在扩散模型,在文本驱动的人类动作生成方面表现出显著的成功。然而,让潜在扩散模型有效地将多个语义概念整合到一个连贯的动作序列中仍然是具有挑战性的。为了解决这个问题,我们提出了EnergyMoGen,它包括两种基于能量模型的光谱:(1) 我们将扩散模型解释为一种感知潜在变量的能量模型,在潜在空间通过组合一组扩散模型来生成动作;(2) 我们引入了一种基于交叉注意力的语义感知能量模型,该模型能够实现语义合成并适应性地进行文本嵌入的梯度下降。为了克服这两种光谱中的语义不一致和动作失真挑战,我们提出了协同能量融合。这种设计使得潜在扩散动动生成模型可以通过结合对应于文本描述的多个能量项来综合高质量、复杂的动作。实验表明,在包括文本到动作生成、组合动作生成和多概念动作生成在内的各种动作生成任务中,我们的方法优于现有的最先进的模型。此外,我们还展示了该方法可以用来扩展动作数据集并提升文本到动动生成的任务性能。

URL

https://arxiv.org/abs/2412.14706

PDF

https://arxiv.org/pdf/2412.14706.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot