Paper Reading AI Learner

Intentional Gesture: Deliver Your Intentions with Gestures for Speech

2025-05-21 07:24:51
Pinxin Liu, Haiyang Liu, Luchuan Song, Chenliang Xu

Abstract

When humans speak, gestures help convey communicative intentions, such as adding emphasis or describing concepts. However, current co-speech gesture generation methods rely solely on superficial linguistic cues (\textit{e.g.} speech audio or text transcripts), neglecting to understand and leverage the communicative intention that underpins human gestures. This results in outputs that are rhythmically synchronized with speech but are semantically shallow. To address this gap, we introduce \textbf{Intentional-Gesture}, a novel framework that casts gesture generation as an intention-reasoning task grounded in high-level communicative functions. % First, we curate the \textbf{InG} dataset by augmenting BEAT-2 with gesture-intention annotations (\textit{i.e.}, text sentences summarizing intentions), which are automatically annotated using large vision-language models. Next, we introduce the \textbf{Intentional Gesture Motion Tokenizer} to leverage these intention annotations. It injects high-level communicative functions (\textit{e.g.}, intentions) into tokenized motion representations to enable intention-aware gesture synthesis that are both temporally aligned and semantically meaningful, achieving new state-of-the-art performance on the BEAT-2 benchmark. Our framework offers a modular foundation for expressive gesture generation in digital humans and embodied AI. Project Page: this https URL

Abstract (translated)

当人类说话时,手势有助于传达沟通意图,例如增加强调或描述概念。然而,目前的同步言语生成手势方法仅依赖于表面语言线索(如语音音频或文本转录),忽略了理解并利用支撑人类手势的交流意图。这导致生成的手势虽然与讲话节奏同步,但在语义上较为浅薄。为解决这一缺口,我们引入了**Intentional-Gesture**,这是一个将手势生成视为基于高层次沟通功能的目的推理任务的新颖框架。 首先,通过增加手势-目的标注(即总结意图的文本句子),我们将BEAT-2数据集扩展为**InG**数据集,并使用大型视觉语言模型自动进行这些目的标注。接下来,我们介绍了**Intentional Gesture Motion Tokenizer**来利用这些目的标注。该方法将高层次沟通功能(如意图)注入到标记化的运动表示中,从而实现既在时间上对齐又语义上有意义的手势合成,在BEAT-2基准测试上实现了新的最先进性能。 我们的框架为数字人类和具身人工智能中的表现力手势生成提供了一个模块化基础。项目页面:[此链接](this https URL)

URL

https://arxiv.org/abs/2505.15197

PDF

https://arxiv.org/pdf/2505.15197.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot