Abstract
Controlling human gestures based on speech signals presents a significant challenge in computer vision. While existing works did preliminary studies of generating holistic co-speech gesture from speech, the spatial interaction of each body region during the speech remains barely explored. This leads to wield body part interactions given the speech signal. Furthermore, the slow generation speed limits the construction of real-world digital avatars. To resolve these problems, we propose \textbf{GestureLSM}, a Latent Shortcut based approach for Co-Speech Gesture Generation with spatial-temporal modeling. We tokenize various body regions and explicitly model their interactions with spatial and temporal attention. To achieve real-time gesture generations, we exam the denoising patterns and design an effective time distribution to speed up sampling while improve the generation quality for shortcut model. Extensive quantitative and qualitative experiments demonstrate the effectiveness of GestureLSM, showcasing its potential for various applications in the development of digital humans and embodied agents. Project Page: this https URL
Abstract (translated)
基于语音信号控制人体姿态在计算机视觉领域面临着重大挑战。虽然现有研究对从语音生成整体伴随言语的手势进行了初步探索,但在讲话过程中身体各部位的空间互动仍然鲜有研究,这导致了给定语音信号时难以处理各个肢体之间的交互关系。此外,缓慢的生成速度限制了真实世界数字虚拟人的构建。 为了解决这些问题,我们提出了**GestureLSM(基于潜在捷径的方法)**用于伴随言语手势生成,并结合空间和时间建模来解决上述问题。我们将身体各部位标记化并明确地通过空间和时间注意力机制建模它们的交互作用。为了实现实时的手势生成,我们研究了去噪模式,并设计了一种有效的时间分布策略以加快采样速度同时提升捷径模型(shortcut model)的生成质量。 广泛的定量和定性实验展示了GestureLSM的有效性,证明其在数字人类和具身代理开发中的各种应用中具有潜力。项目页面链接:[请访问此链接获取更多详情](https://this https URL/)
URL
https://arxiv.org/abs/2501.18898