Abstract
Gestures are inherent to human interaction and often complement speech in face-to-face communication, forming a multimodal communication system. An important task in gesture analysis is detecting a gesture's beginning and end. Research on automatic gesture detection has primarily focused on visual and kinematic information to detect a limited set of isolated or silent gestures with low variability, neglecting the integration of speech and vision signals to detect gestures that co-occur with speech. This work addresses this gap by focusing on co-speech gesture detection, emphasising the synchrony between speech and co-speech hand gestures. We address three main challenges: the variability of gesture forms, the temporal misalignment between gesture and speech onsets, and differences in sampling rate between modalities. We investigate extended speech time windows and employ separate backbone models for each modality to address the temporal misalignment and sampling rate differences. We utilize Transformer encoders in cross-modal and early fusion techniques to effectively align and integrate speech and skeletal sequences. The study results show that combining visual and speech information significantly enhances gesture detection performance. Our findings indicate that expanding the speech buffer beyond visual time segments improves performance and that multimodal integration using cross-modal and early fusion techniques outperforms baseline methods using unimodal and late fusion methods. Additionally, we find a correlation between the models' gesture prediction confidence and low-level speech frequency features potentially associated with gestures. Overall, the study provides a better understanding and detection methods for co-speech gestures, facilitating the analysis of multimodal communication.
Abstract (translated)
手势是人类交互的固有特征,通常在面对面交流中作为语言的补充,形成了一个多模态通信系统。手势分析的一个重要任务是检测手势的开始和结束。在手势检测研究中,主要集中于视觉和运动信息来检测具有较低变异性的一小部分孤立或无声手势,而忽视了语音和视觉信号的融合来检测与语言同时出现的手势。本文通过关注共词手势检测来填补这一空白,强调言语和共词手势之间的同步性。我们解决了三个主要挑战:手势形式的变异性,手势和言语发起时间之间的时序错位,以及模态之间的采样率差异。我们研究了扩展的言语时间窗口,并为每个模态使用单独的骨干模型来解决时序错位和采样率差异。我们利用跨模态和早期融合技术中的Transformer编码器来有效地对齐和整合言语和骨骼序列。研究结果表明,结合视觉和言语信息可以显著增强手势检测性能。我们的研究结果表明,将言语缓冲区扩展到视觉时间段以外可以提高性能,而跨模态和早期融合技术的使用优于使用单模态和晚期融合方法。此外,我们发现模型手势预测信心与可能与手势相关的小级别语音频率特征之间存在相关性。总体而言,研究为共词手势提供了更好的理解和检测方法,促进了多模态通信的分析。
URL
https://arxiv.org/abs/2404.14952