Abstract
Previous research in human gesture recognition has largely overlooked multi-person interactions, which are crucial for understanding the social context of naturally occurring gestures. This limitation in existing datasets presents a significant challenge in aligning human gestures with other modalities like language and speech. To address this issue, we introduce SocialGesture, the first large-scale dataset specifically designed for multi-person gesture analysis. SocialGesture features a diverse range of natural scenarios and supports multiple gesture analysis tasks, including video-based recognition and temporal localization, providing a valuable resource for advancing the study of gesture during complex social interactions. Furthermore, we propose a novel visual question answering (VQA) task to benchmark vision language models'(VLMs) performance on social gesture understanding. Our findings highlight several limitations of current gesture recognition models, offering insights into future directions for improvement in this field. SocialGesture is available at this http URL.
Abstract (translated)
先前的人体姿态识别研究在很大程度上忽视了多人交互,而这对于理解自然发生的肢体语言的社会背景至关重要。现有数据集中的这一限制对将人类手势与其他模态(如语言和语音)进行对齐构成了重大挑战。为解决这个问题,我们引入了SocialGesture,这是首个专为多人群体姿态分析设计的大规模数据集。SocialGesture涵盖了一系列多样化的自然场景,并支持多种姿态分析任务,包括基于视频的识别及时间定位,从而为复杂社会互动中的手势研究提供了宝贵的资源。此外,我们还提出了一种新颖的视觉问答(VQA)任务,用以评估视觉语言模型(VLMs)在社交肢体语言理解方面的性能表现。我们的发现揭示了当前姿态识别模型的一些局限性,并为这一领域的未来发展提供了见解和方向。 SocialGesture数据集可以在以下链接获取:[此URL](http://此URL)
URL
https://arxiv.org/abs/2504.02244