FM-Fusion: Instance-aware Semantic Mapping Boosted by Vision-Language Foundation Models

Abstract
Abstract (translated)
URL
PDF

Abstract

Semantic mapping based on the supervised object detectors is sensitive to image distribution. In real-world environments, the object detection and segmentation performance can lead to a major drop, preventing the use of semantic mapping in a wider domain. On the other hand, the development of vision-language foundation models demonstrates a strong zero-shot transferability across data distribution. It provides an opportunity to construct generalizable instance-aware semantic maps. Hence, this work explores how to boost instance-aware semantic mapping from object detection generated from foundation models. We propose a probabilistic label fusion method to predict close-set semantic classes from open-set label measurements. An instance refinement module merges the over-segmented instances caused by inconsistent segmentation. We integrate all the modules into a unified semantic mapping system. Reading a sequence of RGB-D input, our work incrementally reconstructs an instance-aware semantic map. We evaluate the zero-shot performance of our method in ScanNet and SceneNN datasets. Our method achieves 40.3 mean average precision (mAP) on the ScanNet semantic instance segmentation task. It outperforms the traditional semantic mapping method significantly.

Abstract (translated)

基于监督物体检测的语义映射对图像分布敏感。在现实世界环境中，物体检测和分割性能的下降会导致很大的影响，从而阻止在更广泛的领域中使用语义映射。另一方面，基于视觉语言模型的视觉-语言基础模型展示了对数据分布的强 zero-shot 转移性。这为构建具有普遍实例注意的语义映射提供了机会。因此，本工作探讨了如何从基于基础模型的物体检测中提高实例注意语义映射。我们提出了一种概率标签融合方法，从开环标签测量中预测接近集语义类别。一个实例细化模块将由不一致分割引起的重分割实例合并。我们将所有模块集成到一个统一的语义映射系统中。阅读一个 RGB-D 输入序列，我们的工作逐步重构实例注意语义映射。我们在 ScanNet 和 SceneNN 数据集上评估我们的方法的零散性能。我们的方法在 ScanNet 语义实例分割任务上实现了 40.3 的平均平均精度（mAP）。它远优于传统语义映射方法。

URL

https://arxiv.org/abs/2402.04555

PDF

https://arxiv.org/pdf/2402.04555.pdf

FM-Fusion: Instance-aware Semantic Mapping Boosted by Vision-Language Foundation Models

Abstract

Abstract (translated)

URL

PDF Copy

PDF