Paper Reading AI Learner

DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving

2024-05-07 15:14:20
Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, Liping Jing, Yiming Nie, Bin Dai

Abstract

Vision-centric autonomous driving has recently raised wide attention due to its lower cost. Pre-training is essential for extracting a universal representation. However, current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task. In this paper, we address this challenge by introducing a world model-based autonomous driving 4D representation learning framework, dubbed \emph{DriveWorld}, which is capable of pre-training from multi-camera driving videos in a spatio-temporal fashion. Specifically, we propose a Memory State-Space Model for spatio-temporal modelling, which consists of a Dynamic Memory Bank module for learning temporal-aware latent dynamics to predict future changes and a Static Scene Propagation module for learning spatial-aware latent statics to offer comprehensive scene contexts. We additionally introduce a Task Prompt to decouple task-aware features for various downstream tasks. The experiments demonstrate that DriveWorld delivers promising results on various autonomous driving tasks. When pre-trained with the OpenScene dataset, DriveWorld achieves a 7.5% increase in mAP for 3D object detection, a 3.0% increase in IoU for online mapping, a 5.0% increase in AMOTA for multi-object tracking, a 0.1m decrease in minADE for motion forecasting, a 3.0% increase in IoU for occupancy prediction, and a 0.34m reduction in average L2 error for planning.

Abstract (translated)

近年来,随着其较低成本,视觉中心化的自动驾驶引起了广泛关注。预训练对于提取普遍表示至关重要。然而,目前视觉中心化的预训练通常依赖于2D或3D预训练任务,忽视了自动驾驶作为4D场景理解任务的时空特征。在本文中,我们通过引入基于世界模型的自动驾驶4D表示学习框架\emph{DriveWorld}来解决这一挑战。该框架能够以时空方式从多摄像头驾驶视频中进行预训练。具体来说,我们提出了一个记忆状态空间模型进行空间-时间建模,包括动态内存库模块用于学习时空感知到的潜在动态,静态场景传播模块用于学习空间感知到的潜在静态,以提供全面的场景上下文。我们还引入了一个任务提示,用于解耦各种下游任务的关注点特征。实验证明,DriveWorld在各种自动驾驶任务上取得了很好的效果。当使用OpenScene数据集进行预训练时,DriveWorld在3D物体检测上实现了7.5%的mAP增加,在在线地图上实现了3%的IoU增加,在多对象跟踪上实现了5%的AMOTA增加,在运动预测中降低了0.1m的minADE,在占用预测上实现了3%的IoU增加,在规划中减少了0.34m的L2误差。

URL

https://arxiv.org/abs/2405.04390

PDF

https://arxiv.org/pdf/2405.04390.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot