Abstract
Diffusion models, particularly latent diffusion models, have demonstrated remarkable success in text-driven human motion generation. However, it remains challenging for latent diffusion models to effectively compose multiple semantic concepts into a single, coherent motion sequence. To address this issue, we propose EnergyMoGen, which includes two spectrums of Energy-Based Models: (1) We interpret the diffusion model as a latent-aware energy-based model that generates motions by composing a set of diffusion models in latent space; (2) We introduce a semantic-aware energy model based on cross-attention, which enables semantic composition and adaptive gradient descent for text embeddings. To overcome the challenges of semantic inconsistency and motion distortion across these two spectrums, we introduce Synergistic Energy Fusion. This design allows the motion latent diffusion model to synthesize high-quality, complex motions by combining multiple energy terms corresponding to textual descriptions. Experiments show that our approach outperforms existing state-of-the-art models on various motion generation tasks, including text-to-motion generation, compositional motion generation, and multi-concept motion generation. Additionally, we demonstrate that our method can be used to extend motion datasets and improve the text-to-motion task.
Abstract (translated)
扩散模型,尤其是潜在扩散模型,在文本驱动的人类动作生成方面表现出显著的成功。然而,让潜在扩散模型有效地将多个语义概念整合到一个连贯的动作序列中仍然是具有挑战性的。为了解决这个问题,我们提出了EnergyMoGen,它包括两种基于能量模型的光谱:(1) 我们将扩散模型解释为一种感知潜在变量的能量模型,在潜在空间通过组合一组扩散模型来生成动作;(2) 我们引入了一种基于交叉注意力的语义感知能量模型,该模型能够实现语义合成并适应性地进行文本嵌入的梯度下降。为了克服这两种光谱中的语义不一致和动作失真挑战,我们提出了协同能量融合。这种设计使得潜在扩散动动生成模型可以通过结合对应于文本描述的多个能量项来综合高质量、复杂的动作。实验表明,在包括文本到动作生成、组合动作生成和多概念动作生成在内的各种动作生成任务中,我们的方法优于现有的最先进的模型。此外,我们还展示了该方法可以用来扩展动作数据集并提升文本到动动生成的任务性能。
URL
https://arxiv.org/abs/2412.14706