Paper Reading AI Learner

AD-L-JEPA: Self-Supervised Spatial World Models with Joint Embedding Predictive Architecture for Autonomous Driving with LiDAR Data

2025-01-09 04:47:51
Haoran Zhu, Zhenyuan Dong, Kristi Topollai, Anna Choromanska

Abstract

As opposed to human drivers, current autonomous driving systems still require vast amounts of labeled data to train. Recently, world models have been proposed to simultaneously enhance autonomous driving capabilities by improving the way these systems understand complex real-world environments and reduce their data demands via self-supervised pre-training. In this paper, we present AD-L-JEPA (aka Autonomous Driving with LiDAR data via a Joint Embedding Predictive Architecture), a novel self-supervised pre-training framework for autonomous driving with LiDAR data that, as opposed to existing methods, is neither generative nor contrastive. Our method learns spatial world models with a joint embedding predictive architecture. Instead of explicitly generating masked unknown regions, our self-supervised world models predict Bird's Eye View (BEV) embeddings to represent the diverse nature of autonomous driving scenes. Our approach furthermore eliminates the need to manually create positive and negative pairs, as is the case in contrastive learning. AD-L-JEPA leads to simpler implementation and enhanced learned representations. We qualitatively and quantitatively demonstrate high-quality of embeddings learned with AD-L-JEPA. We furthermore evaluate the accuracy and label efficiency of AD-L-JEPA on popular downstream tasks such as LiDAR 3D object detection and associated transfer learning. Our experimental evaluation demonstrates that AD-L-JEPA is a plausible approach for self-supervised pre-training in autonomous driving applications and is the best available approach outperforming SOTA, including most recently proposed Occupancy-MAE [1] and ALSO [2]. The source code of AD-L-JEPA is available at this https URL.

Abstract (translated)

与人类驾驶员不同,当前的自动驾驶系统仍然需要大量的标注数据来进行训练。最近,世界模型被提出以同时增强这些系统的理解能力,使其更好地处理复杂的现实环境,并通过自我监督的预训练来减少其对数据的需求。在本文中,我们提出了AD-L-JEPA(即基于LiDAR数据并通过联合嵌入预测架构进行自动驾驶),这是一种新颖的针对自动驾驶中的LiDAR数据的自监督预训练框架,与现有方法不同的是,它既不是生成式的也不是对比式的。我们的方法通过联合嵌入预测架构学习空间世界模型。不同于明确地生成遮蔽未知区域的方式,我们的自监督世界模型会预测俯视图(BEV)嵌入以表示自动驾驶场景的多样性。此外,我们所提出的方法还消除了创建正负样本对的需求,这是对比学习中需要手动完成的任务。因此,AD-L-JEPA简化了实现过程,并提升了学到的表示能力。我们在定性和定量上展示了通过AD-L-JEPA学得的嵌入具有高质量的特点。 为了评估AD-L-JEPA在下游任务中的准确性以及标注效率,我们对包括LiDAR 3D物体检测和相关迁移学习在内的流行任务进行了测试。实验结果表明,AD-L-JEPA是自监督预训练应用于自动驾驶领域的一种可行方法,并且优于现有的最佳方法(SOTA),包括最近提出的Occupancy-MAE [1]和ALSO [2]。 AD-L-JEPA的源代码可以在此网址获取:[此URL]。

URL

https://arxiv.org/abs/2501.04969

PDF

https://arxiv.org/pdf/2501.04969.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot