Paper Reading AI Learner

NeRF-DetS: Enhancing Multi-View 3D Object Detection with Sampling-adaptive Network of Continuous NeRF-based Representation

2024-04-22 06:59:03
Chi Huang, Xinyang Li, Shengchuan Zhang, Liujuan Cao, Rongrong Ji

Abstract

As a preliminary work, NeRF-Det unifies the tasks of novel view synthesis and 3D perception, demonstrating that perceptual tasks can benefit from novel view synthesis methods like NeRF, significantly improving the performance of indoor multi-view 3D object detection. Using the geometry MLP of NeRF to direct the attention of detection head to crucial parts and incorporating self-supervised loss from novel view rendering contribute to the achieved improvement. To better leverage the notable advantages of the continuous representation through neural rendering in space, we introduce a novel 3D perception network structure, NeRF-DetS. The key component of NeRF-DetS is the Multi-level Sampling-Adaptive Network, making the sampling process adaptively from coarse to fine. Also, we propose a superior multi-view information fusion method, known as Multi-head Weighted Fusion. This fusion approach efficiently addresses the challenge of losing multi-view information when using arithmetic mean, while keeping low computational costs. NeRF-DetS outperforms competitive NeRF-Det on the ScanNetV2 dataset, by achieving +5.02% and +5.92% improvement in mAP@.25 and mAP@.50, respectively.

Abstract (translated)

作为初步工作,NeRF-Det 统一了 novel view synthesis 和 3D 感知任务,证明了 NeRF 这样的感知任务可以通过 novel view synthesis 方法受益,显著提高了室内多视图 3D 物体检测的性能。利用 NeRF 的几何 MLP 指导检测头的注意力,并将来自 novel view 渲染的自监督损失融入其中,有助于实现所取得的改进。为了更好地利用连续空间表示中的显著优势,我们在 NeRF-Det 上引入了一个新的 3D 感知网络结构 NeRF-DetS。NeRF-DetS 的关键组件是 Multi-level Sampling-Adaptive Network,使抽样过程从粗到细进行自适应。此外,我们提出了一个更好的多视图信息融合方法,称为 Multi-head Weighted Fusion。这种融合方法有效地解决了使用算术平均值时丢失多视图信息的问题,同时保持较低的计算成本。在 ScanNetV2 数据集上,NeRF-DetS 超越了竞争 NeRF-Det,实现了 +5.02% 和 +5.92% 的 mAP@.25 和 mAP@.50 改善。

URL

https://arxiv.org/abs/2404.13921

PDF

https://arxiv.org/pdf/2404.13921.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot