Paper Reading AI Learner

FM-Fusion: Instance-aware Semantic Mapping Boosted by Vision-Language Foundation Models

2024-02-07 03:19:02
Chuhao Liu, Ke Wang, Jieqi Shi, Zhijian Qiao, Shaojie Shen

Abstract

Semantic mapping based on the supervised object detectors is sensitive to image distribution. In real-world environments, the object detection and segmentation performance can lead to a major drop, preventing the use of semantic mapping in a wider domain. On the other hand, the development of vision-language foundation models demonstrates a strong zero-shot transferability across data distribution. It provides an opportunity to construct generalizable instance-aware semantic maps. Hence, this work explores how to boost instance-aware semantic mapping from object detection generated from foundation models. We propose a probabilistic label fusion method to predict close-set semantic classes from open-set label measurements. An instance refinement module merges the over-segmented instances caused by inconsistent segmentation. We integrate all the modules into a unified semantic mapping system. Reading a sequence of RGB-D input, our work incrementally reconstructs an instance-aware semantic map. We evaluate the zero-shot performance of our method in ScanNet and SceneNN datasets. Our method achieves 40.3 mean average precision (mAP) on the ScanNet semantic instance segmentation task. It outperforms the traditional semantic mapping method significantly.

Abstract (translated)

基于监督物体检测的语义映射对图像分布敏感。在现实世界环境中,物体检测和分割性能的下降会导致很大的影响,从而阻止在更广泛的领域中使用语义映射。另一方面,基于视觉语言模型的视觉-语言基础模型展示了对数据分布的强 zero-shot 转移性。这为构建具有普遍实例注意的语义映射提供了机会。因此,本工作探讨了如何从基于基础模型的物体检测中提高实例注意语义映射。我们提出了一种概率标签融合方法,从开环标签测量中预测接近集语义类别。一个实例细化模块将由不一致分割引起的重分割实例合并。我们将所有模块集成到一个统一的语义映射系统中。阅读一个 RGB-D 输入序列,我们的工作逐步重构实例注意语义映射。我们在 ScanNet 和 SceneNN 数据集上评估我们的方法的零散性能。我们的方法在 ScanNet 语义实例分割任务上实现了 40.3 的平均平均精度(mAP)。它远优于传统语义映射方法。

URL

https://arxiv.org/abs/2402.04555

PDF

https://arxiv.org/pdf/2402.04555.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot