Paper Reading AI Learner

GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling

2025-01-31 05:34:59
Pinxin Liu, Luchuan Song, Junhua Huang, Chenliang Xu

Abstract

Controlling human gestures based on speech signals presents a significant challenge in computer vision. While existing works did preliminary studies of generating holistic co-speech gesture from speech, the spatial interaction of each body region during the speech remains barely explored. This leads to wield body part interactions given the speech signal. Furthermore, the slow generation speed limits the construction of real-world digital avatars. To resolve these problems, we propose \textbf{GestureLSM}, a Latent Shortcut based approach for Co-Speech Gesture Generation with spatial-temporal modeling. We tokenize various body regions and explicitly model their interactions with spatial and temporal attention. To achieve real-time gesture generations, we exam the denoising patterns and design an effective time distribution to speed up sampling while improve the generation quality for shortcut model. Extensive quantitative and qualitative experiments demonstrate the effectiveness of GestureLSM, showcasing its potential for various applications in the development of digital humans and embodied agents. Project Page: this https URL

Abstract (translated)

基于语音信号控制人体姿态在计算机视觉领域面临着重大挑战。虽然现有研究对从语音生成整体伴随言语的手势进行了初步探索,但在讲话过程中身体各部位的空间互动仍然鲜有研究,这导致了给定语音信号时难以处理各个肢体之间的交互关系。此外,缓慢的生成速度限制了真实世界数字虚拟人的构建。 为了解决这些问题,我们提出了**GestureLSM(基于潜在捷径的方法)**用于伴随言语手势生成,并结合空间和时间建模来解决上述问题。我们将身体各部位标记化并明确地通过空间和时间注意力机制建模它们的交互作用。为了实现实时的手势生成,我们研究了去噪模式,并设计了一种有效的时间分布策略以加快采样速度同时提升捷径模型(shortcut model)的生成质量。 广泛的定量和定性实验展示了GestureLSM的有效性,证明其在数字人类和具身代理开发中的各种应用中具有潜力。项目页面链接:[请访问此链接获取更多详情](https://this https URL/)

URL

https://arxiv.org/abs/2501.18898

PDF

https://arxiv.org/pdf/2501.18898.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot