Abstract
In this paper, we introduce the first large-scale video prediction model in the autonomous driving discipline. To eliminate the restriction of high-cost data collection and empower the generalization ability of our model, we acquire massive data from the web and pair it with diverse and high-quality text descriptions. The resultant dataset accumulates over 2000 hours of driving videos, spanning areas all over the world with diverse weather conditions and traffic scenarios. Inheriting the merits from recent latent diffusion models, our model, dubbed GenAD, handles the challenging dynamics in driving scenes with novel temporal reasoning blocks. We showcase that it can generalize to various unseen driving datasets in a zero-shot manner, surpassing general or driving-specific video prediction counterparts. Furthermore, GenAD can be adapted into an action-conditioned prediction model or a motion planner, holding great potential for real-world driving applications.
Abstract (translated)
在本文中,我们在自动驾驶领域引入了第一个大规模视频预测模型。为了消除高成本数据收集的限制并增强模型的泛化能力,我们从互联网上获取大量数据,并将其与多样且高质量的文本描述相结合。这样产生的数据集累积了超过2000小时的驾驶视频,覆盖了世界各地各种天气条件和交通场景。在继承最近来自潜在扩散模型的优点的基础上,我们称之为GenAD的模型处理了驾驶场景中具有新颖的时间推理单元的挑战性动态。我们展示,它可以以零散的方式推广到各种未见过的驾驶数据集,超越了通用或驾驶特定视频预测的替代品。此外,GenAD可以改编成动作条件预测模型或运动规划器,在现实驾驶应用中具有巨大的潜力。
URL
https://arxiv.org/abs/2403.09630