Paper Reading AI Learner

Towards Unified Co-Speech Gesture Generation via Hierarchical Implicit Periodicity Learning

2025-12-15 09:43:08
Xin Guo, Yifan Zhao, Jia Li

Abstract

Generating 3D-based body movements from speech shows great potential in extensive downstream applications, while it still suffers challenges in imitating realistic human movements. Predominant research efforts focus on end-to-end generation schemes to generate co-speech gestures, spanning GANs, VQ-VAE, and recent diffusion models. As an ill-posed problem, in this paper, we argue that these prevailing learning schemes fail to model crucial inter- and intra-correlations across different motion units, i.e. head, body, and hands, thus leading to unnatural movements and poor coordination. To delve into these intrinsic correlations, we propose a unified Hierarchical Implicit Periodicity (HIP) learning approach for audio-inspired 3D gesture generation. Different from predominant research, our approach models this multi-modal implicit relationship by two explicit technique insights: i) To disentangle the complicated gesture movements, we first explore the gesture motion phase manifolds with periodic autoencoders to imitate human natures from realistic distributions while incorporating non-period ones from current latent states for instance-level diversities. ii) To model the hierarchical relationship of face motions, body gestures, and hand movements, driving the animation with cascaded guidance during learning. We exhibit our proposed approach on 3D avatars and extensive experiments show our method outperforms the state-of-the-art co-speech gesture generation methods by both quantitative and qualitative evaluations. Code and models will be publicly available.

Abstract (translated)

从语音生成基于3D的身体动作在广泛的下游应用中展现出巨大潜力,然而它仍然面临模仿真实人体运动的挑战。目前的研究主要集中在端到端生成方案上,以生成与言语同步的手势,涵盖了GAN、VQ-VAE以及最近的扩散模型。作为一种病态问题,在本文中我们论证了这些流行的学习方案未能充分建模不同动作单元(如头部、身体和手)之间的重要内在和外在相关性,从而导致不自然的动作和协调性差。 为了深入探究这些内在关联,我们提出了一种统一的分层隐式周期性(HIP)学习方法,用于语音启发式的3D手势生成。与主流研究不同,我们的方法通过两种明确的技术洞察来建模这种多模态的隐含关系:i) 为了解构复杂的动作模式,我们首先使用周期自动编码器探索手势运动相位流形,并从真实分布中模仿人类特性,同时结合非周期性的当前潜在状态以实现实例级别的多样性。ii) 为了模型面部、身体和手部动作之间的层级关系,在学习过程中采用级联引导来驱动动画。 我们在3D化身上演示了我们提出的方法,并通过广泛的实验展示了我们的方法在定量和定性评估中都超越了最先进的与言语同步手势生成方法的性能。代码和模型将公开发布。

URL

https://arxiv.org/abs/2512.13131

PDF

https://arxiv.org/pdf/2512.13131.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot