Abstract
Multi-modal perception is essential for unmanned aerial vehicle (UAV) operations, as it enables a comprehensive understanding of the UAVs' surrounding environment. However, most existing multi-modal UAV datasets are primarily biased toward localization and 3D reconstruction tasks, or only support map-level semantic segmentation due to the lack of frame-wise annotations for both camera images and LiDAR point clouds. This limitation prevents them from being used for high-level scene understanding tasks. To address this gap and advance multi-modal UAV perception, we introduce UAVScenes, a large-scale dataset designed to benchmark various tasks across both 2D and 3D modalities. Our benchmark dataset is built upon the well-calibrated multi-modal UAV dataset MARS-LVIG, originally developed only for simultaneous localization and mapping (SLAM). We enhance this dataset by providing manually labeled semantic annotations for both frame-wise images and LiDAR point clouds, along with accurate 6-degree-of-freedom (6-DoF) poses. These additions enable a wide range of UAV perception tasks, including segmentation, depth estimation, 6-DoF localization, place recognition, and novel view synthesis (NVS). Our dataset is available at this https URL
Abstract (translated)
多模态感知对于无人驾驶航空器(UAV)操作至关重要,因为它能够全面理解无人机周围的环境。然而,现有的大多数多模态UAV数据集主要偏向于定位和三维重建任务,或者仅支持地图级别的语义分割,这是由于缺乏摄像机图像和激光雷达点云的逐帧注释导致的。这种局限性使它们无法用于高层次场景理解任务。为了填补这一空白并推进多模态无人机感知技术的发展,我们引入了UAVScenes,这是一个大型数据集,旨在评估二维和三维模式下的多种任务表现。我们的基准数据集基于经过良好校准的多模态UAV数据集MARS-LVIG构建,该数据集最初仅用于同时定位与地图创建(SLAM)技术。我们通过为逐帧图像和激光雷达点云提供手动标注的语义注释以及精确的六自由度(6-DoF)姿态信息来增强这一数据集。这些新增加的内容使得一系列无人机感知任务得以实现,包括分割、深度估计、六自由度定位、地方识别及新视角合成(NVS)。我们的数据集可以在提供的网址上获取。
URL
https://arxiv.org/abs/2507.22412