Abstract
Human speech is often accompanied by hand and arm gestures. Given audio speech input, we generate plausible gestures to go along with the sound. Specifically, we perform cross-modal translation from "in-the-wild'' monologue speech of a single speaker to their hand and arm motion. We train on unlabeled videos for which we only have noisy pseudo ground truth from an automatic pose detection system. Our proposed model significantly outperforms baseline methods in a quantitative comparison. To support research toward obtaining a computational understanding of the relationship between gesture and speech, we release a large video dataset of person-specific gestures. The project website with video, code and data can be found at this http URL .
Abstract (translated)
人类的语言往往伴随着手势和手臂。有了语音输入,我们就可以做出合理的手势来配合声音。具体地说,我们执行跨模态翻译,从一个说话人的“野生”独白到他们的手和手臂动作。我们在没有标签的视频上训练,对于这些视频,我们只有来自自动姿态检测系统的嘈杂的伪地面实况。在定量比较中,我们提出的模型明显优于基线方法。为了支持对手势和语言之间关系的计算理解的研究,我们发布了一个特定于人的手势的大型视频数据集。包含视频、代码和数据的项目网站可以在这个HTTP URL中找到。
URL
https://arxiv.org/abs/1906.04160