RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving

Abstract
Abstract (translated)
URL
PDF

Abstract

Casting semantic segmentation of outdoor LiDAR point clouds as a 2D problem, e.g., via range projection, is an effective and popular approach. These projection-based methods usually benefit from fast computations and, when combined with techniques which use other point cloud representations, achieve state-of-the-art results. Today, projection-based methods leverage 2D CNNs but recent advances in computer vision show that vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks. In this work, we question if projection-based methods for 3D semantic segmentation can benefit from these latest improvements on ViTs. We answer positively but only after combining them with three key ingredients: (a) ViTs are notoriously hard to train and require a lot of training data to learn powerful representations. By preserving the same backbone architecture as for RGB images, we can exploit the knowledge from long training on large image collections that are much cheaper to acquire and annotate than point clouds. We reach our best results with pre-trained ViTs on large image datasets. (b) We compensate ViTs' lack of inductive bias by substituting a tailored convolutional stem for the classical linear embedding layer. (c) We refine pixel-wise predictions with a convolutional decoder and a skip connection from the convolutional stem to combine low-level but fine-grained features of the the convolutional stem with the high-level but coarse predictions of the ViT encoder. With these ingredients, we show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and SemanticKITTI. We provide the implementation code at this https URL.

Abstract (translated)

将室外激光雷达点云的语义分割视为2D问题,例如通过范围投影,是一种有效且受欢迎的方法。这些投影方法通常受益于快速的计算,并与其他点云表示方法结合,可以获得最先进的结果。目前,投影方法利用2D卷积神经网络,但计算机视觉的最新进展表明,视觉转换器(ViTs)在许多基于图像基准测试中获得了最先进的结果。在这项工作中,我们质疑是否可以利用ViTs的最新改进来改善3D语义分割方法。我们的回答是在结合三个关键成分之后:(a) ViTs通常很难训练,需要大量训练数据来学习强大的表示。通过保持与RGB图像相同的基本骨架架构,我们可以利用长期训练在大型图像集上获得的知识。我们使用预先训练的ViTs在大型图像数据集上获得最佳结果。(b) 我们补偿ViTs缺乏迁移偏见,通过替换传统的线性嵌入层卷积种子来取代。(c) 我们优化像素级预测,使用卷积解码器和卷积种子的跳连接,将卷积种子的低级别但精细的特征与ViT编码器的高级别但粗略的预测相结合。通过这些成分,我们表明,我们的方法称为RangeViT,在nuScenes和SemanticKITTI中比现有的投影方法表现更好。我们在这个https URL上提供了实现代码。

URL

https://arxiv.org/abs/2301.10222

PDF

https://arxiv.org/pdf/2301.10222.pdf