Abstract
Gestures play a key role in human communication. Recent methods for co-speech gesture generation, while managing to generate beat-aligned motions, struggle generating gestures that are semantically aligned with the utterance. Compared to beat gestures that align naturally to the audio signal, semantically coherent gestures require modeling the complex interactions between the language and human motion, and can be controlled by focusing on certain words. Therefore, we present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis, which can not only generate gestures based on multi-modal speech inputs, but can also facilitate controllability in gesture synthesis. Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities (e.g. audio vs text) as well as to choose certain words to be emphasized during gesturing. Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures. To further advance the research on multi-party interactive gestures, the DnD Group Gesture dataset is released, which contains 6 hours of gesture data showing 5 people interacting with one another. We compare our method with several recent works and demonstrate effectiveness of our method on a variety of tasks. We urge the reader to watch our supplementary video at our website.
Abstract (translated)
手势在人类交流中扮演着关键角色。虽然最近的方法在合奏语音手势生成的同时,管理生成与音符同步的手势,但仍然难以生成与言语内容 semantically 对齐的手势。与自然同步到音频信号的音符手势相比,语义上的协调手势需要建模语言和人类动作之间的复杂交互,并且可以通过专注于某些单词进行控制。因此,我们提出了ConvoFusion,一种基于扩散的多模态手势合成方法,不仅可以基于多模态语音输入生成手势,还可以在手势合成中促进可控性。我们的方法提出了两个指导目标,允许用户调整不同调节模式的影響(例如音频 vs 文本)以及选择在手势过程中强调某些单词。我们的方法具有多才多艺,因为它可以用于生成独白手势或甚至对话手势。为了进一步推动多方交互手势的研究,DnD Group Gesture数据集发布,其中包括5个人相互作用的手势数据,持续了6个小时。我们比较了我们的方法与几个最近的工作,并证明了我们的方法在各种任务上都具有有效性。我们呼吁读者查看我们网站的补充视频。
URL
https://arxiv.org/abs/2403.17936