Abstract
Transfomer-based approaches advance the recent development of multi-camera 3D detection both in academia and industry. In a vanilla transformer architecture, queries are randomly initialised and optimised for the whole dataset, without considering the differences among input frames. In this work, we propose to leverage the predictions from an image backbone, which is often highly optimised for 2D tasks, as priors to the transformer part of a 3D detection network. The method works by (1). augmenting image feature maps with 2D priors, (2). sampling query locations via ray-casting along 2D box centroids, as well as (3). initialising query features with object-level image features. Experimental results shows that 2D priors not only help the model converge faster, but also largely improve the baseline approach by up to 12% in terms of average precision.
Abstract (translated)
Transformer-based approaches推进了学术界和工业界多摄像头3D检测的最新发展。在纯Transformer架构中,查询是随机初始化和优化整个数据集,而不考虑输入帧之间的差异。在本研究中,我们提议利用图像基线的预测,该预测通常针对2D任务进行高度优化,将其作为3D检测网络中的Transformer部分的前置条件。方法通过(1)增加图像特征映射的2D前置,(2)通过沿着2D框中心线的ray-casting采样查询位置,以及(3)初始化查询特征以物体级别的图像特征。实验结果表明,2D前置不仅帮助模型更快地收敛,而且在平均精度方面主要提高了12%。
URL
https://arxiv.org/abs/2301.13592