DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving

Abstract
Abstract (translated)
URL
PDF

Abstract

Vision-centric autonomous driving has recently raised wide attention due to its lower cost. Pre-training is essential for extracting a universal representation. However, current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task. In this paper, we address this challenge by introducing a world model-based autonomous driving 4D representation learning framework, dubbed \emph{DriveWorld}, which is capable of pre-training from multi-camera driving videos in a spatio-temporal fashion. Specifically, we propose a Memory State-Space Model for spatio-temporal modelling, which consists of a Dynamic Memory Bank module for learning temporal-aware latent dynamics to predict future changes and a Static Scene Propagation module for learning spatial-aware latent statics to offer comprehensive scene contexts. We additionally introduce a Task Prompt to decouple task-aware features for various downstream tasks. The experiments demonstrate that DriveWorld delivers promising results on various autonomous driving tasks. When pre-trained with the OpenScene dataset, DriveWorld achieves a 7.5% increase in mAP for 3D object detection, a 3.0% increase in IoU for online mapping, a 5.0% increase in AMOTA for multi-object tracking, a 0.1m decrease in minADE for motion forecasting, a 3.0% increase in IoU for occupancy prediction, and a 0.34m reduction in average L2 error for planning.

Abstract (translated)

近年来，随着其较低成本，视觉中心化的自动驾驶引起了广泛关注。预训练对于提取普遍表示至关重要。然而，目前视觉中心化的预训练通常依赖于2D或3D预训练任务，忽视了自动驾驶作为4D场景理解任务的时空特征。在本文中，我们通过引入基于世界模型的自动驾驶4D表示学习框架\emph{DriveWorld}来解决这一挑战。该框架能够以时空方式从多摄像头驾驶视频中进行预训练。具体来说，我们提出了一个记忆状态空间模型进行空间-时间建模，包括动态内存库模块用于学习时空感知到的潜在动态，静态场景传播模块用于学习空间感知到的潜在静态，以提供全面的场景上下文。我们还引入了一个任务提示，用于解耦各种下游任务的关注点特征。实验证明，DriveWorld在各种自动驾驶任务上取得了很好的效果。当使用OpenScene数据集进行预训练时，DriveWorld在3D物体检测上实现了7.5%的mAP增加，在在线地图上实现了3%的IoU增加，在多对象跟踪上实现了5%的AMOTA增加，在运动预测中降低了0.1m的minADE，在占用预测上实现了3%的IoU增加，在规划中减少了0.34m的L2误差。

URL

https://arxiv.org/abs/2405.04390

PDF

https://arxiv.org/pdf/2405.04390.pdf

DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving

Abstract

Abstract (translated)

URL

PDF Copy

PDF