Paper Reading AI Learner

RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving

2023-01-24 18:50:48
Angelika Ando, Spyros Gidaris, Andrei Bursuc, Gilles Puy, Alexandre Boulch, Renaud Marlet

Abstract

Casting semantic segmentation of outdoor LiDAR point clouds as a 2D problem, e.g., via range projection, is an effective and popular approach. These projection-based methods usually benefit from fast computations and, when combined with techniques which use other point cloud representations, achieve state-of-the-art results. Today, projection-based methods leverage 2D CNNs but recent advances in computer vision show that vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks. In this work, we question if projection-based methods for 3D semantic segmentation can benefit from these latest improvements on ViTs. We answer positively but only after combining them with three key ingredients: (a) ViTs are notoriously hard to train and require a lot of training data to learn powerful representations. By preserving the same backbone architecture as for RGB images, we can exploit the knowledge from long training on large image collections that are much cheaper to acquire and annotate than point clouds. We reach our best results with pre-trained ViTs on large image datasets. (b) We compensate ViTs' lack of inductive bias by substituting a tailored convolutional stem for the classical linear embedding layer. (c) We refine pixel-wise predictions with a convolutional decoder and a skip connection from the convolutional stem to combine low-level but fine-grained features of the the convolutional stem with the high-level but coarse predictions of the ViT encoder. With these ingredients, we show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and SemanticKITTI. We provide the implementation code at this https URL.

Abstract (translated)

将室外激光雷达点云的语义分割视为2D问题,例如通过范围投影,是一种有效且受欢迎的方法。这些投影方法通常受益于快速的计算,并与其他点云表示方法结合,可以获得最先进的结果。目前,投影方法利用2D卷积神经网络,但计算机视觉的最新进展表明,视觉转换器(ViTs)在许多基于图像基准测试中获得了最先进的结果。在这项工作中,我们质疑是否可以利用ViTs的最新改进来改善3D语义分割方法。我们的回答是在结合三个关键成分之后:(a) ViTs通常很难训练,需要大量训练数据来学习强大的表示。通过保持与RGB图像相同的基本骨架架构,我们可以利用长期训练在大型图像集上获得的知识。我们使用预先训练的ViTs在大型图像数据集上获得最佳结果。(b) 我们补偿ViTs缺乏迁移偏见,通过替换传统的线性嵌入层卷积种子来取代。(c) 我们优化像素级预测,使用卷积解码器和卷积种子的跳连接,将卷积种子的低级别但精细的特征与ViT编码器的高级别但粗略的预测相结合。通过这些成分,我们表明,我们的方法称为RangeViT,在nuScenes和SemanticKITTI中比现有的投影方法表现更好。我们在这个https URL上提供了实现代码。

URL

https://arxiv.org/abs/2301.10222

PDF

https://arxiv.org/pdf/2301.10222.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot