Paper Reading AI Learner

BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion

2025-06-18 16:40:05
Yuqing Lan, Chenyang Zhu, Zhirui Gao, Jiazhao Zhang, Yihan Cao, Renjiao Yi, Yijie Wang, Kai Xu

Abstract

Open-vocabulary 3D object detection has gained significant interest due to its critical applications in autonomous driving and embodied AI. Existing detection methods, whether offline or online, typically rely on dense point cloud reconstruction, which imposes substantial computational overhead and memory constraints, hindering real-time deployment in downstream tasks. To address this, we propose a novel reconstruction-free online framework tailored for memory-efficient and real-time 3D detection. Specifically, given streaming posed RGB-D video input, we leverage Cubify Anything as a pre-trained visual foundation model (VFM) for single-view 3D object detection by bounding boxes, coupled with CLIP to capture open-vocabulary semantics of detected objects. To fuse all detected bounding boxes across different views into a unified one, we employ an association module for correspondences of multi-views and an optimization module to fuse the 3D bounding boxes of the same instance predicted in multi-views. The association module utilizes 3D Non-Maximum Suppression (NMS) and a box correspondence matching module, while the optimization module uses an IoU-guided efficient random optimization technique based on particle filtering to enforce multi-view consistency of the 3D bounding boxes while minimizing computational complexity. Extensive experiments on ScanNetV2 and CA-1M datasets demonstrate that our method achieves state-of-the-art performance among online methods. Benefiting from this novel reconstruction-free paradigm for 3D object detection, our method exhibits great generalization abilities in various scenarios, enabling real-time perception even in environments exceeding 1000 square meters.

Abstract (translated)

开放词汇的三维物体检测由于其在自动驾驶和具身人工智能中的关键应用而引起了广泛关注。现有的检测方法,无论是离线还是在线方法,通常依赖于密集点云重建,这会带来巨大的计算开销和内存限制,阻碍了在下游任务中实现实时部署。为了解决这个问题,我们提出了一种新颖的无重构在线框架,该框架针对内存效率高且能实现实时3D检测进行了优化。具体来说,在给定连续输入的位置标注RGB-D视频的情况下,我们将Cubify Anything作为单视图三维物体检测(通过边界框)的预训练视觉基础模型(VFM)使用,并结合CLIP来捕捉检测到的对象的开放词汇语义信息。 为了将不同视角中所有检测出的边界框融合成一个统一的结果,我们采用了一个关联模块来处理多视角之间的对应关系以及一个优化模块用于融合同一实例在多个视图中的三维边界框。该关联模块利用了3D非极大值抑制(NMS)和一个边界框对应匹配模块;而优化模块则使用基于粒子滤波的IoU引导高效随机优化技术,以确保跨多视角的三维边界框的一致性,并尽量减少计算复杂度。 在ScanNetV2和CA-1M数据集上的大量实验表明,我们的方法在在线方法中实现了最先进的性能。得益于这一新颖的无重构3D物体检测范式,我们的方法展示出了在各种场景中的强大泛化能力,甚至能够在超过1000平方米的环境中实现实时感知。

URL

https://arxiv.org/abs/2506.15610

PDF

https://arxiv.org/pdf/2506.15610.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot