Abstract
Recent advancements in vision foundation models (VFMs) have revolutionized visual perception in 2D, yet their potential for 3D scene understanding, particularly in autonomous driving applications, remains underexplored. In this paper, we introduce LargeAD, a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets. Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples. This alignment facilitates cross-modal representation learning, enhancing the semantic consistency between 2D and 3D data. We introduce several key innovations: i) VFM-driven superpixel generation for detailed semantic representation, ii) a VFM-assisted contrastive learning strategy to align multimodal features, iii) superpoint temporal consistency to maintain stable representations across time, and iv) multi-source data pretraining to generalize across various LiDAR configurations. Our approach delivers significant performance improvements over state-of-the-art methods in both linear probing and fine-tuning tasks for both LiDAR-based segmentation and object detection. Extensive experiments on eleven large-scale multi-modal datasets highlight our superior performance, demonstrating the adaptability, efficiency, and robustness in real-world autonomous driving scenarios.
Abstract (translated)
最近,视觉基础模型(VFMs)在二维视觉感知领域的进展已经彻底革新了这一领域,然而它们在三维场景理解中的潜力,尤其是在自动驾驶应用中,仍未被充分探索。在这篇论文中,我们介绍了LargeAD,这是一个为大规模3D预训练设计的多功能、可扩展框架,适用于各种现实世界驾驶数据集。我们的框架利用VFMs从2D图像中提取语义丰富的超像素,并将其与激光雷达点云对齐以生成高质量的对比样本。这种对齐促进了跨模态表示学习,增强了二维和三维数据之间的语义一致性。 我们引入了几项关键创新: i) 由VFM驱动的超像素生成,用于详细的语义表示; ii) 一种辅助VFM进行的对比学习策略,以对齐多模式特征; iii) 超点时间一致性,以保持跨时间的稳定表示; iv) 多源数据预训练,以适应各种激光雷达配置。 我们的方法在基于LiDAR的分割和目标检测任务中的线性探测和微调任务上都比最先进的方法表现出显著的性能改进。我们在十一项大规模多模态数据集上的广泛实验中证明了我们方法的优越性能,展示了其在现实世界自动驾驶场景中的适应性、效率和鲁棒性。
URL
https://arxiv.org/abs/2501.04005