Abstract
In autonomous driving, it is crucial to correctly interpret traffic gestures (TGs), such as those of an authority figure providing orders or instructions, or a pedestrian signaling the driver, to ensure a safe and pleasant traffic environment for all road users. This study investigates the capabilities of state-of-the-art vision-language models (VLMs) in zero-shot interpretation, focusing on their ability to caption and classify human gestures in traffic contexts. We create and publicly share two custom datasets with varying formal and informal TGs, such as 'Stop', 'Reverse', 'Hail', etc. The datasets are "Acted TG (ATG)" and "Instructive TG In-The-Wild (ITGI)". They are annotated with natural language, describing the pedestrian's body position and gesture. We evaluate models using three methods utilizing expert-generated captions as baseline and control: (1) caption similarity, (2) gesture classification, and (3) pose sequence reconstruction similarity. Results show that current VLMs struggle with gesture understanding: sentence similarity averages below 0.59, and classification F1 scores reach only 0.14-0.39, well below the expert baseline of 0.70. While pose reconstruction shows potential, it requires more data and refined metrics to be reliable. Our findings reveal that although some SOTA VLMs can interpret zero-shot human traffic gestures, none are accurate and robust enough to be trustworthy, emphasizing the need for further research in this domain.
Abstract (translated)
在自动驾驶领域,正确解读交通手势(TG)至关重要。这些手势可能来自指挥交通的执法人员或行人向驾驶员发出信号的情况,以确保所有道路使用者的安全和顺畅环境。本研究探讨了目前最先进的视觉-语言模型(VLMs)在零样本解释中的能力,重点关注其对交通环境中人类手势进行描述和分类的能力。我们创建并公开分享了两个定制数据集,包含不同形式的正式和非正式TG,例如“停止”、“倒车”、“招手”等。这些数据集分别是“表演TG(ATG)”和“实际指示TG(ITGI)”,它们用自然语言注释,描述行人的身体位置和手势。 我们采用三种方法评估模型性能,并以专家生成的说明作为基准和控制: 1. 句子相似度 2. 手势分类 3. 姿态序列重建相似度 结果显示:目前的VLM在理解手势方面存在困难。句子相似性的平均值低于0.59,分类F1分数仅为0.14-0.39,远低于专家基准的0.70。虽然姿态重构显示出潜力,但需要更多的数据和更精细的度量标准才能变得可靠。 我们的研究结果表明,尽管一些最先进的VLM能够解释零样本的人类交通手势,但是没有一种模型在准确性和可靠性方面足够强大以确保信任,这突显了该领域进一步研究的需求。
URL
https://arxiv.org/abs/2504.10873