Paper Reading AI Learner

Can Vision-Language Models Understand and Interpret Dynamic Gestures from Pedestrians? Pilot Datasets and Exploration Towards Instructive Nonverbal Commands for Cooperative Autonomous Vehicles

2025-04-15 05:04:25
Tonko E. W. Bossen, Andreas M{\o}gelmose, Ross Greer

Abstract

In autonomous driving, it is crucial to correctly interpret traffic gestures (TGs), such as those of an authority figure providing orders or instructions, or a pedestrian signaling the driver, to ensure a safe and pleasant traffic environment for all road users. This study investigates the capabilities of state-of-the-art vision-language models (VLMs) in zero-shot interpretation, focusing on their ability to caption and classify human gestures in traffic contexts. We create and publicly share two custom datasets with varying formal and informal TGs, such as 'Stop', 'Reverse', 'Hail', etc. The datasets are "Acted TG (ATG)" and "Instructive TG In-The-Wild (ITGI)". They are annotated with natural language, describing the pedestrian's body position and gesture. We evaluate models using three methods utilizing expert-generated captions as baseline and control: (1) caption similarity, (2) gesture classification, and (3) pose sequence reconstruction similarity. Results show that current VLMs struggle with gesture understanding: sentence similarity averages below 0.59, and classification F1 scores reach only 0.14-0.39, well below the expert baseline of 0.70. While pose reconstruction shows potential, it requires more data and refined metrics to be reliable. Our findings reveal that although some SOTA VLMs can interpret zero-shot human traffic gestures, none are accurate and robust enough to be trustworthy, emphasizing the need for further research in this domain.

Abstract (translated)

在自动驾驶领域,正确解读交通手势(TG)至关重要。这些手势可能来自指挥交通的执法人员或行人向驾驶员发出信号的情况,以确保所有道路使用者的安全和顺畅环境。本研究探讨了目前最先进的视觉-语言模型(VLMs)在零样本解释中的能力,重点关注其对交通环境中人类手势进行描述和分类的能力。我们创建并公开分享了两个定制数据集,包含不同形式的正式和非正式TG,例如“停止”、“倒车”、“招手”等。这些数据集分别是“表演TG(ATG)”和“实际指示TG(ITGI)”,它们用自然语言注释,描述行人的身体位置和手势。 我们采用三种方法评估模型性能,并以专家生成的说明作为基准和控制: 1. 句子相似度 2. 手势分类 3. 姿态序列重建相似度 结果显示:目前的VLM在理解手势方面存在困难。句子相似性的平均值低于0.59,分类F1分数仅为0.14-0.39,远低于专家基准的0.70。虽然姿态重构显示出潜力,但需要更多的数据和更精细的度量标准才能变得可靠。 我们的研究结果表明,尽管一些最先进的VLM能够解释零样本的人类交通手势,但是没有一种模型在准确性和可靠性方面足够强大以确保信任,这突显了该领域进一步研究的需求。

URL

https://arxiv.org/abs/2504.10873

PDF

https://arxiv.org/pdf/2504.10873.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot