Abstract
3D scene understanding plays a vital role in vision-based autonomous driving. While most existing methods focus on 3D object detection, they have difficulty describing real-world objects of arbitrary shapes and infinite classes. Towards a more comprehensive perception of a 3D scene, in this paper, we propose a SurroundOcc method to predict the 3D occupancy with multi-camera images. We first extract multi-scale features for each image and adopt spatial 2D-3D attention to lift them to the 3D volume space. Then we apply 3D convolutions to progressively upsample the volume features and impose supervision on multiple levels. To obtain dense occupancy prediction, we design a pipeline to generate dense occupancy ground truth without expansive occupancy annotations. Specifically, we fuse multi-frame LiDAR scans of dynamic objects and static scenes separately. Then we adopt Poisson Reconstruction to fill the holes and voxelize the mesh to get dense occupancy labels. Extensive experiments on nuScenes and SemanticKITTI datasets demonstrate the superiority of our method. Code and dataset are available at this https URL
Abstract (translated)
3D场景理解在视觉自主驾驶中扮演着至关重要的角色。虽然大多数现有方法专注于3D物体检测,但它们很难描述任意形状和无限类的现实世界物体。为了更全面了解一个3D场景,在本文中,我们提出了一种周围场景检测方法,以预测多相机图像中的3D利用率。我们首先提取每个图像的多尺度特征,并采用空间2D-3D注意力将它们提高到3D体积空间。然后,我们应用3D卷积以逐步扩展体积特征并施加多层次的监督。为了获得密集利用率预测,我们设计了一条 pipeline 生成密集利用率 ground truth 而不需要什么昂贵的注释。具体而言,我们分别对动态物体和静态场景进行多帧LiDAR扫描,然后采用Poisson重建填充孔洞,并使用立方体网格生成密集利用率标签。对nuScenes和SemanticKITTI数据集进行的广泛实验证明了我们方法的优越性。代码和数据集可在该https URL上获取。
URL
https://arxiv.org/abs/2303.09551