MS-Occ: Multi-Stage LiDAR-Camera Fusion for 3D Semantic Occupancy Prediction

Abstract
Abstract (translated)
URL
PDF

Abstract

Accurate 3D semantic occupancy perception is essential for autonomous driving in complex environments with diverse and irregular objects. While vision-centric methods suffer from geometric inaccuracies, LiDAR-based approaches often lack rich semantic information. To address these limitations, MS-Occ, a novel multi-stage LiDAR-camera fusion framework which includes middle-stage fusion and late-stage fusion, is proposed, integrating LiDAR's geometric fidelity with camera-based semantic richness via hierarchical cross-modal fusion. The framework introduces innovations at two critical stages: (1) In the middle-stage feature fusion, the Gaussian-Geo module leverages Gaussian kernel rendering on sparse LiDAR depth maps to enhance 2D image features with dense geometric priors, and the Semantic-Aware module enriches LiDAR voxels with semantic context via deformable cross-attention; (2) In the late-stage voxel fusion, the Adaptive Fusion (AF) module dynamically balances voxel features across modalities, while the High Classification Confidence Voxel Fusion (HCCVF) module resolves semantic inconsistencies using self-attention-based refinement. Experiments on the nuScenes-OpenOccupancy benchmark show that MS-Occ achieves an Intersection over Union (IoU) of 32.1% and a mean IoU (mIoU) of 25.3%, surpassing the state-of-the-art by +0.7% IoU and +2.4% mIoU. Ablation studies further validate the contribution of each module, with substantial improvements in small-object perception, demonstrating the practical value of MS-Occ for safety-critical autonomous driving scenarios.

Abstract (translated)

在复杂且包含多种不规则物体的环境中，自动驾驶需要准确的3D语义占用感知。虽然视觉中心方法存在几何准确性的问题，而基于激光雷达的方法通常缺乏丰富的语义信息。为了克服这些限制，提出了一种新的多阶段激光雷达-摄像头融合框架——MS-Occ，该框架包括中间级和后期融合两个部分，通过分层跨模态融合技术将激光雷达的几何精度与相机的语义丰富性相结合。这个框架在两个关键环节进行了创新： 1. 在中间级别的特征融合中，Gaussian-Geo模块利用稀疏激光雷达深度图上的高斯核渲染来增强二维图像特征，并引入密集的几何先验；Semantic-Aware模块则通过可变形交叉注意力机制为激光雷达体素添加语义上下文。 2. 在后期的体素融合阶段，自适应融合（AF）模块动态地平衡不同模态之间的体素特征，而高分类置信度体素融合（HCCVF）模块使用基于自我注意的方法解决语义不一致问题。在nuScenes-OpenOccupancy基准测试中，MS-Occ框架实现了交并比（IoU）为32.1%，平均交并比（mIoU）为25.3%的成绩，分别超越了现有最佳方法0.7% IoU和2.4% mIoU。消融研究进一步验证了每个模块的贡献，并显示在小型物体感知方面有显著改善，证明了MS-Occ框架在关键安全场景中的实际价值。这一创新性融合技术为解决自动驾驶中复杂环境下的3D语义占用问题提供了一种有效的解决方案。

URL

https://arxiv.org/abs/2504.15888

PDF

https://arxiv.org/pdf/2504.15888.pdf

MS-Occ: Multi-Stage LiDAR-Camera Fusion for 3D Semantic Occupancy Prediction

Abstract

Abstract (translated)

URL

PDF Copy

PDF