Abstract
Most existing traffic video datasets including Waymo are structured, focusing predominantly on Western traffic, which hinders global applicability. Specifically, most Asian scenarios are far more complex, involving numerous objects with distinct motions and behaviors. Addressing this gap, we present a new dataset, DAVE, designed for evaluating perception methods with high representation of Vulnerable Road Users (VRUs: e.g. pedestrians, animals, motorbikes, and bicycles) in complex and unpredictable environments. DAVE is a manually annotated dataset encompassing 16 diverse actor categories (spanning animals, humans, vehicles, etc.) and 16 action types (complex and rare cases like cut-ins, zigzag movement, U-turn, etc.), which require high reasoning ability. DAVE densely annotates over 13 million bounding boxes (bboxes) actors with identification, and more than 1.6 million boxes are annotated with both actor identification and action/behavior details. The videos within DAVE are collected based on a broad spectrum of factors, such as weather conditions, the time of day, road scenarios, and traffic density. DAVE can benchmark video tasks like Tracking, Detection, Spatiotemporal Action Localization, Language-Visual Moment retrieval, and Multi-label Video Action Recognition. Given the critical importance of accurately identifying VRUs to prevent accidents and ensure road safety, in DAVE, vulnerable road users constitute 41.13% of instances, compared to 23.71% in Waymo. DAVE provides an invaluable resource for the development of more sensitive and accurate visual perception algorithms in the complex real world. Our experiments show that existing methods suffer degradation in performance when evaluated on DAVE, highlighting its benefit for future video recognition research.
Abstract (translated)
现有的大多数交通视频数据集,包括Waymo在内,都侧重于结构化和以西方交通为主的场景,这限制了其全球适用性。特别是,在亚洲的许多情景中,道路状况更加复杂,涉及众多具有独特运动模式和行为的对象。为了解决这一差距,我们提出了一个新的数据集DAVE(Diverse Actors in Varied Environments),旨在评估在复杂且不可预测环境中准确识别弱势道路使用者(VRUs:例如行人、动物、摩托车、自行车等)的感知方法。 DAVE是一个手动标注的数据集,涵盖了16种不同的对象类别(包括动物、人类和车辆等)以及16种动作类型(如切线插入、蛇形运动、U型转弯等复杂的罕见案例),这些都需要高水平的推理能力。在DAVE中,有超过1300万个边界框被详细标注以识别物体,并且其中超过160万个边界框还包含了行为和动作细节的信息。 该数据集中的视频根据多种因素收集而来,包括天气条件、一天中的时间、道路场景以及交通密度等。基于这些多样化的特性,DAVE可以作为追踪、检测、时空动作定位、语言-视觉时刻检索及多标签视频动作识别等任务的评估基准。 鉴于准确识别VRUs对于防止事故和保障道路交通安全至关重要,在DAVE数据集中,弱势道路使用者占总实例的41.13%,而Waymo中这一比例仅为23.71%。因此,DAVE为开发更敏感且精确的道路视觉感知算法提供了宝贵的资源,特别是在复杂的真实世界环境中。 实验表明,现有的方法在评估过程中使用DAVE时表现不佳,这凸显了其对未来视频识别研究的重要性与价值。
URL
https://arxiv.org/abs/2412.20042