Paper Reading AI Learner

SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction

2024-04-15 06:45:06
Pin Tang, Zhongdao Wang, Guoqing Wang, Jilai Zheng, Xiangxuan Ren, Bailan Feng, Chao Ma

Abstract

Vision-based perception for autonomous driving requires an explicit modeling of a 3D space, where 2D latent representations are mapped and subsequent 3D operators are applied. However, operating on dense latent spaces introduces a cubic time and space complexity, which limits scalability in terms of perception range or spatial resolution. Existing approaches compress the dense representation using projections like Bird's Eye View (BEV) or Tri-Perspective View (TPV). Although efficient, these projections result in information loss, especially for tasks like semantic occupancy prediction. To address this, we propose SparseOcc, an efficient occupancy network inspired by sparse point cloud processing. It utilizes a lossless sparse latent representation with three key innovations. Firstly, a 3D sparse diffuser performs latent completion using spatially decomposed 3D sparse convolutional kernels. Secondly, a feature pyramid and sparse interpolation enhance scales with information from others. Finally, the transformer head is redesigned as a sparse variant. SparseOcc achieves a remarkable 74.9% reduction on FLOPs over the dense baseline. Interestingly, it also improves accuracy, from 12.8% to 14.1% mIOU, which in part can be attributed to the sparse representation's ability to avoid hallucinations on empty voxels.

Abstract (translated)

基于视觉的自动驾驶需要对3D空间进行显式的建模,其中2D潜在表示被映射,然后应用后续的3D操作。然而,在密集的潜在空间中操作会引入 cubic 时间和空间复杂度,从而限制了感知范围或空间分辨率的可扩展性。现有的方法通过像Bird's Eye View(BEV)或Tri-Perspective View(TPV)这样的投影来压缩密集表示。尽管这些投影有效,但它们导致信息损失,尤其是对于诸如语义 occupancy 预测等任务。为了应对这个问题,我们提出了SparseOcc,一种基于稀疏点云处理的节能占用网络。它采用了一种无损失的稀疏潜在表示,具有三个关键创新。首先,3D稀疏扩散器通过空间分解的3D稀疏卷积核进行潜在完成。其次,特征金字塔和稀疏插值增强了来自其他的信息。最后,将Transformer头重新设计为稀疏变体。SparseOcc在密集基线上的FLOPs减少了74.9%。有趣的是,它还提高了精度,从12.8%到14.1%的mIOU,这部分可以归因于稀疏表示避免空洞像素的幻觉的能力。

URL

https://arxiv.org/abs/2404.09502

PDF

https://arxiv.org/pdf/2404.09502.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot