Abstract
Ensuring traffic safety and preventing accidents is a critical goal in daily driving, where the advancement of computer vision technologies can be leveraged to achieve this goal. In this paper, we present a multi-view, multi-scale framework for naturalistic driving action recognition and localization in untrimmed videos, namely M$^2$DAR, with a particular focus on detecting distracted driving behaviors. Our system features a weight-sharing, multi-scale Transformer-based action recognition network that learns robust hierarchical representations. Furthermore, we propose a new election algorithm consisting of aggregation, filtering, merging, and selection processes to refine the preliminary results from the action recognition module across multiple views. Extensive experiments conducted on the 7th AI City Challenge Track 3 dataset demonstrate the effectiveness of our approach, where we achieved an overlap score of 0.5921 on the A2 test set. Our source code is available at \url{this https URL}.
Abstract (translated)
确保交通安全并防止事故是日常生活中的一个关键目标,而计算机视觉技术的不断进步可以在这方面发挥作用来实现这一目标。在本文中,我们提出了一种多视角多尺度的框架,用于自然场景驾驶行为识别和局部化在未修剪的视频中的活动,即 M$^2$DAR,并特别注重检测分心驾驶行为。我们的系统采用共享权重的多尺度Transformer行动识别网络来学习可靠的层次化表示。此外,我们提出了一种新的选举算法,包括聚合、过滤、合并和选择过程,以优化来自不同视角的行动识别模块的初步结果。在第七期AI城市挑战Track 3数据集的实验中,我们对A2测试集进行了广泛的实验,取得了0.5921的重叠得分。我们源代码可访问 \url{this https URL}。
URL
https://arxiv.org/abs/2305.08877