Abstract
We introduce a self-supervised pretraining method, called OcFeat, for camera-only Bird's-Eye-View (BEV) segmentation networks. With OccFeat, we pretrain a BEV network via occupancy prediction and feature distillation tasks. Occupancy prediction provides a 3D geometric understanding of the scene to the model. However, the geometry learned is class-agnostic. Hence, we add semantic information to the model in the 3D space through distillation from a self-supervised pretrained image foundation model. Models pretrained with our method exhibit improved BEV semantic segmentation performance, particularly in low-data scenarios. Moreover, empirical results affirm the efficacy of integrating feature distillation with 3D occupancy prediction in our pretraining approach.
Abstract (translated)
我们提出了一种名为OcFeat的自监督预训练方法,用于相机仅鸟眼视(BEV)分割网络。通过OcFeat,我们通过占有预测和特征蒸馏任务预训练BEV网络。占有预测提供了场景的3D几何理解给模型。然而,学习到的几何是分类无关的。因此,我们在3D空间中通过自监督预训练图像基础模型进行语义信息添加。使用我们方法预训练的模型表现出 improved BEV语义分割性能,特别是在低数据场景中。此外,实验结果证实了在我们的预训练方法中整合特征蒸馏与3D占有预测的有效性。
URL
https://arxiv.org/abs/2404.14027