Fast-BEV: A Fast and Strong Bird's-Eye View Perception Baseline

Abstract
Abstract (translated)
URL
PDF

Abstract

Recently, perception task based on Bird's-Eye View (BEV) representation has drawn more and more attention, and BEV representation is promising as the foundation for next-generation Autonomous Vehicle (AV) perception. However, most existing BEV solutions either require considerable resources to execute on-vehicle inference or suffer from modest performance. This paper proposes a simple yet effective framework, termed Fast-BEV , which is capable of performing faster BEV perception on the on-vehicle chips. Towards this goal, we first empirically find that the BEV representation can be sufficiently powerful without expensive transformer based transformation nor depth representation. Our Fast-BEV consists of five parts, We novelly propose (1) a lightweight deployment-friendly view transformation which fast transfers 2D image feature to 3D voxel space, (2) an multi-scale image encoder which leverages multi-scale information for better performance, (3) an efficient BEV encoder which is particularly designed to speed up on-vehicle inference. We further introduce (4) a strong data augmentation strategy for both image and BEV space to avoid over-fitting, (5) a multi-frame feature fusion mechanism to leverage the temporal information. Through experiments, on 2080Ti platform, our R50 model can run 52.6 FPS with 47.3% NDS on the nuScenes validation set, exceeding the 41.3 FPS and 47.5% NDS of the BEVDepth-R50 model and 30.2 FPS and 45.7% NDS of the BEVDet4D-R50 model. Our largest model (R101@900x1600) establishes a competitive 53.5% NDS on the nuScenes validation set. We further develop a benchmark with considerable accuracy and efficiency on current popular on-vehicle chips. The code is released at: this https URL.

Abstract (translated)

最近,基于鸟眼视图(BEV)表示的感知任务越来越受到关注,并且BEV表示作为新一代自动驾驶车辆(AV)感知的基础前景光明。然而,大多数现有的BEV解决方案要么需要大量资源执行车内推理,要么表现平平。本文提出了一个简单但有效的框架,称为Fast-BEV,可以在车内芯片上更快地执行BEV感知。为实现这一目标,我们首先经验证,BEV表示在没有昂贵的transformerbased变换或深度表示的情况下可以足够强大。Fast-BEV由五个部分组成,我们创新性地提议(1)一种轻量级的部署友好的视图变换,可以快速将2D图像特征转移到3D立方体空间,(2)一种多尺度图像编码器,利用多尺度信息提高性能,(3)一种高效的BEV编码器,特别设计用于加速车内推理。我们还介绍了(4)一种强烈的数据增强策略,包括图像和BEV空间的两个,以避免过拟合,(5)一种多帧特征融合机制,利用时间信息。通过实验,在2080Ti平台上,我们的R50模型可以在nuScenes验证集上运行52.6fps,并具有47.3%的NDS,超过了BEVDepth-R50模型的41.3fps和47.5%的NDS,以及BEVdet4D-R50模型的30.2fps和45.7%的NDS。我们最大的模型(R101@900x1600)在nuScenes验证集上建立了一个竞争的53.5%的NDS。我们还开发了一台具有相当准确和高效的当前流行的车内芯片基准。代码已发布在:这个httpsURL。

URL

https://arxiv.org/abs/2301.12511

PDF

https://arxiv.org/pdf/2301.12511.pdf