Paper Reading AI Learner

ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis

2024-03-26 17:59:52
Muhammad Hamza Mughal, Rishabh Dabral, Ikhsanul Habibie, Lucia Donatelli, Marc Habermann, Christian Theobalt

Abstract

Gestures play a key role in human communication. Recent methods for co-speech gesture generation, while managing to generate beat-aligned motions, struggle generating gestures that are semantically aligned with the utterance. Compared to beat gestures that align naturally to the audio signal, semantically coherent gestures require modeling the complex interactions between the language and human motion, and can be controlled by focusing on certain words. Therefore, we present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis, which can not only generate gestures based on multi-modal speech inputs, but can also facilitate controllability in gesture synthesis. Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities (e.g. audio vs text) as well as to choose certain words to be emphasized during gesturing. Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures. To further advance the research on multi-party interactive gestures, the DnD Group Gesture dataset is released, which contains 6 hours of gesture data showing 5 people interacting with one another. We compare our method with several recent works and demonstrate effectiveness of our method on a variety of tasks. We urge the reader to watch our supplementary video at our website.

Abstract (translated)

手势在人类交流中扮演着关键角色。虽然最近的方法在合奏语音手势生成的同时,管理生成与音符同步的手势,但仍然难以生成与言语内容 semantically 对齐的手势。与自然同步到音频信号的音符手势相比,语义上的协调手势需要建模语言和人类动作之间的复杂交互,并且可以通过专注于某些单词进行控制。因此,我们提出了ConvoFusion,一种基于扩散的多模态手势合成方法,不仅可以基于多模态语音输入生成手势,还可以在手势合成中促进可控性。我们的方法提出了两个指导目标,允许用户调整不同调节模式的影響(例如音频 vs 文本)以及选择在手势过程中强调某些单词。我们的方法具有多才多艺,因为它可以用于生成独白手势或甚至对话手势。为了进一步推动多方交互手势的研究,DnD Group Gesture数据集发布,其中包括5个人相互作用的手势数据,持续了6个小时。我们比较了我们的方法与几个最近的工作,并证明了我们的方法在各种任务上都具有有效性。我们呼吁读者查看我们网站的补充视频。

URL

https://arxiv.org/abs/2403.17936

PDF

https://arxiv.org/pdf/2403.17936.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot