Abstract
Generating 3D-based body movements from speech shows great potential in extensive downstream applications, while it still suffers challenges in imitating realistic human movements. Predominant research efforts focus on end-to-end generation schemes to generate co-speech gestures, spanning GANs, VQ-VAE, and recent diffusion models. As an ill-posed problem, in this paper, we argue that these prevailing learning schemes fail to model crucial inter- and intra-correlations across different motion units, i.e. head, body, and hands, thus leading to unnatural movements and poor coordination. To delve into these intrinsic correlations, we propose a unified Hierarchical Implicit Periodicity (HIP) learning approach for audio-inspired 3D gesture generation. Different from predominant research, our approach models this multi-modal implicit relationship by two explicit technique insights: i) To disentangle the complicated gesture movements, we first explore the gesture motion phase manifolds with periodic autoencoders to imitate human natures from realistic distributions while incorporating non-period ones from current latent states for instance-level diversities. ii) To model the hierarchical relationship of face motions, body gestures, and hand movements, driving the animation with cascaded guidance during learning. We exhibit our proposed approach on 3D avatars and extensive experiments show our method outperforms the state-of-the-art co-speech gesture generation methods by both quantitative and qualitative evaluations. Code and models will be publicly available.
Abstract (translated)
从语音生成基于3D的身体动作在广泛的下游应用中展现出巨大潜力,然而它仍然面临模仿真实人体运动的挑战。目前的研究主要集中在端到端生成方案上,以生成与言语同步的手势,涵盖了GAN、VQ-VAE以及最近的扩散模型。作为一种病态问题,在本文中我们论证了这些流行的学习方案未能充分建模不同动作单元(如头部、身体和手)之间的重要内在和外在相关性,从而导致不自然的动作和协调性差。 为了深入探究这些内在关联,我们提出了一种统一的分层隐式周期性(HIP)学习方法,用于语音启发式的3D手势生成。与主流研究不同,我们的方法通过两种明确的技术洞察来建模这种多模态的隐含关系:i) 为了解构复杂的动作模式,我们首先使用周期自动编码器探索手势运动相位流形,并从真实分布中模仿人类特性,同时结合非周期性的当前潜在状态以实现实例级别的多样性。ii) 为了模型面部、身体和手部动作之间的层级关系,在学习过程中采用级联引导来驱动动画。 我们在3D化身上演示了我们提出的方法,并通过广泛的实验展示了我们的方法在定量和定性评估中都超越了最先进的与言语同步手势生成方法的性能。代码和模型将公开发布。
URL
https://arxiv.org/abs/2512.13131