Abstract
Vision-based perception for autonomous driving requires an explicit modeling of a 3D space, where 2D latent representations are mapped and subsequent 3D operators are applied. However, operating on dense latent spaces introduces a cubic time and space complexity, which limits scalability in terms of perception range or spatial resolution. Existing approaches compress the dense representation using projections like Bird's Eye View (BEV) or Tri-Perspective View (TPV). Although efficient, these projections result in information loss, especially for tasks like semantic occupancy prediction. To address this, we propose SparseOcc, an efficient occupancy network inspired by sparse point cloud processing. It utilizes a lossless sparse latent representation with three key innovations. Firstly, a 3D sparse diffuser performs latent completion using spatially decomposed 3D sparse convolutional kernels. Secondly, a feature pyramid and sparse interpolation enhance scales with information from others. Finally, the transformer head is redesigned as a sparse variant. SparseOcc achieves a remarkable 74.9% reduction on FLOPs over the dense baseline. Interestingly, it also improves accuracy, from 12.8% to 14.1% mIOU, which in part can be attributed to the sparse representation's ability to avoid hallucinations on empty voxels.
Abstract (translated)
基于视觉的自动驾驶需要对3D空间进行显式的建模,其中2D潜在表示被映射,然后应用后续的3D操作。然而,在密集的潜在空间中操作会引入 cubic 时间和空间复杂度,从而限制了感知范围或空间分辨率的可扩展性。现有的方法通过像Bird's Eye View(BEV)或Tri-Perspective View(TPV)这样的投影来压缩密集表示。尽管这些投影有效,但它们导致信息损失,尤其是对于诸如语义 occupancy 预测等任务。为了应对这个问题,我们提出了SparseOcc,一种基于稀疏点云处理的节能占用网络。它采用了一种无损失的稀疏潜在表示,具有三个关键创新。首先,3D稀疏扩散器通过空间分解的3D稀疏卷积核进行潜在完成。其次,特征金字塔和稀疏插值增强了来自其他的信息。最后,将Transformer头重新设计为稀疏变体。SparseOcc在密集基线上的FLOPs减少了74.9%。有趣的是,它还提高了精度,从12.8%到14.1%的mIOU,这部分可以归因于稀疏表示避免空洞像素的幻觉的能力。
URL
https://arxiv.org/abs/2404.09502