Abstract
Camera-based 3D object detection in Bird's Eye View (BEV) is one of the most important perception tasks in autonomous driving. Earlier methods rely on dense BEV features, which are costly to construct. More recent works explore sparse query-based detection. However, they still require a large number of queries and can become expensive to run when more video frames are used. In this paper, we propose DySS, a novel method that employs state-space learning and dynamic queries. More specifically, DySS leverages a state-space model (SSM) to sequentially process the sampled features over time steps. In order to encourage the model to better capture the underlying motion and correspondence information, we introduce auxiliary tasks of future prediction and masked reconstruction to better train the SSM. The state of the SSM then provides an informative yet efficient summarization of the scene. Based on the state-space learned features, we dynamically update the queries via merge, remove, and split operations, which help maintain a useful, lean set of detection queries throughout the network. Our proposed DySS achieves both superior detection performance and efficient inference. Specifically, on the nuScenes test split, DySS achieves 65.31 NDS and 57.4 mAP, outperforming the latest state of the art. On the val split, DySS achieves 56.2 NDS and 46.2 mAP, as well as a real-time inference speed of 33 FPS.
Abstract (translated)
基于相机的Bird's Eye View(BEV)中的三维物体检测是自动驾驶中最关键的任务之一。早期的方法依赖于密集的BEV特征,这在构造过程中成本较高。最近的工作则探索了稀疏查询基检测方法,尽管如此,这些方法仍然需要大量的查询,在处理更多视频帧时可能会变得非常昂贵。 在这篇文章中,我们提出了DySS,这是一种采用状态空间学习和动态查询的新方法。具体来说,DySS利用一个状态空间模型(SSM)来逐时间步长处理采样特征。为了鼓励模型更好地捕捉底层运动和对应信息,我们在训练SSM时引入了未来预测和掩码重建的辅助任务。这样,SSM的状态可以提供一种既丰富又高效地对场景进行总结的方法。 基于从状态空间学习得到的特征,我们通过合并、移除和分裂操作动态更新查询集,在整个网络中保持有用且精简的检测查询集。 我们的DySS方法实现了优异的检测性能以及高效的推理速度。具体而言,在nuScenes测试集中,DySS达到了65.31 NDS(Normalized Detection Score)和57.4 mAP(mean Average Precision),超越了最新的技术前沿。在验证数据集上,DySS分别获得了56.2的NDS和46.2的mAP,并且实现了每秒33帧的真实时间推理速度。
URL
https://arxiv.org/abs/2506.10242