Abstract
When humans speak, gestures help convey communicative intentions, such as adding emphasis or describing concepts. However, current co-speech gesture generation methods rely solely on superficial linguistic cues (\textit{e.g.} speech audio or text transcripts), neglecting to understand and leverage the communicative intention that underpins human gestures. This results in outputs that are rhythmically synchronized with speech but are semantically shallow. To address this gap, we introduce \textbf{Intentional-Gesture}, a novel framework that casts gesture generation as an intention-reasoning task grounded in high-level communicative functions. % First, we curate the \textbf{InG} dataset by augmenting BEAT-2 with gesture-intention annotations (\textit{i.e.}, text sentences summarizing intentions), which are automatically annotated using large vision-language models. Next, we introduce the \textbf{Intentional Gesture Motion Tokenizer} to leverage these intention annotations. It injects high-level communicative functions (\textit{e.g.}, intentions) into tokenized motion representations to enable intention-aware gesture synthesis that are both temporally aligned and semantically meaningful, achieving new state-of-the-art performance on the BEAT-2 benchmark. Our framework offers a modular foundation for expressive gesture generation in digital humans and embodied AI. Project Page: this https URL
Abstract (translated)
当人类说话时,手势有助于传达沟通意图,例如增加强调或描述概念。然而,目前的同步言语生成手势方法仅依赖于表面语言线索(如语音音频或文本转录),忽略了理解并利用支撑人类手势的交流意图。这导致生成的手势虽然与讲话节奏同步,但在语义上较为浅薄。为解决这一缺口,我们引入了**Intentional-Gesture**,这是一个将手势生成视为基于高层次沟通功能的目的推理任务的新颖框架。 首先,通过增加手势-目的标注(即总结意图的文本句子),我们将BEAT-2数据集扩展为**InG**数据集,并使用大型视觉语言模型自动进行这些目的标注。接下来,我们介绍了**Intentional Gesture Motion Tokenizer**来利用这些目的标注。该方法将高层次沟通功能(如意图)注入到标记化的运动表示中,从而实现既在时间上对齐又语义上有意义的手势合成,在BEAT-2基准测试上实现了新的最先进性能。 我们的框架为数字人类和具身人工智能中的表现力手势生成提供了一个模块化基础。项目页面:[此链接](this https URL)
URL
https://arxiv.org/abs/2505.15197