Paper Reading AI Learner

LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving

2025-01-07 18:59:59
Lingdong Kong, Xiang Xu, Youquan Liu, Jun Cen, Runnan Chen, Wenwei Zhang, Liang Pan, Kai Chen, Ziwei Liu

Abstract

Recent advancements in vision foundation models (VFMs) have revolutionized visual perception in 2D, yet their potential for 3D scene understanding, particularly in autonomous driving applications, remains underexplored. In this paper, we introduce LargeAD, a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets. Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples. This alignment facilitates cross-modal representation learning, enhancing the semantic consistency between 2D and 3D data. We introduce several key innovations: i) VFM-driven superpixel generation for detailed semantic representation, ii) a VFM-assisted contrastive learning strategy to align multimodal features, iii) superpoint temporal consistency to maintain stable representations across time, and iv) multi-source data pretraining to generalize across various LiDAR configurations. Our approach delivers significant performance improvements over state-of-the-art methods in both linear probing and fine-tuning tasks for both LiDAR-based segmentation and object detection. Extensive experiments on eleven large-scale multi-modal datasets highlight our superior performance, demonstrating the adaptability, efficiency, and robustness in real-world autonomous driving scenarios.

Abstract (translated)

最近,视觉基础模型(VFMs)在二维视觉感知领域的进展已经彻底革新了这一领域,然而它们在三维场景理解中的潜力,尤其是在自动驾驶应用中,仍未被充分探索。在这篇论文中,我们介绍了LargeAD,这是一个为大规模3D预训练设计的多功能、可扩展框架,适用于各种现实世界驾驶数据集。我们的框架利用VFMs从2D图像中提取语义丰富的超像素,并将其与激光雷达点云对齐以生成高质量的对比样本。这种对齐促进了跨模态表示学习,增强了二维和三维数据之间的语义一致性。 我们引入了几项关键创新: i) 由VFM驱动的超像素生成,用于详细的语义表示; ii) 一种辅助VFM进行的对比学习策略,以对齐多模式特征; iii) 超点时间一致性,以保持跨时间的稳定表示; iv) 多源数据预训练,以适应各种激光雷达配置。 我们的方法在基于LiDAR的分割和目标检测任务中的线性探测和微调任务上都比最先进的方法表现出显著的性能改进。我们在十一项大规模多模态数据集上的广泛实验中证明了我们方法的优越性能,展示了其在现实世界自动驾驶场景中的适应性、效率和鲁棒性。

URL

https://arxiv.org/abs/2501.04005

PDF

https://arxiv.org/pdf/2501.04005.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot