Paper Reading AI Learner

Contrastive Gaussian Clustering: Weakly Supervised 3D Scene Segmentation

2024-04-19 10:47:53
Myrna C. Silva, Mahtab Dahaghin, Matteo Toso, Alessio Del Bue

Abstract

We introduce Contrastive Gaussian Clustering, a novel approach capable of provide segmentation masks from any viewpoint and of enabling 3D segmentation of the scene. Recent works in novel-view synthesis have shown how to model the appearance of a scene via a cloud of 3D Gaussians, and how to generate accurate images from a given viewpoint by projecting on it the Gaussians before $\alpha$ blending their color. Following this example, we train a model to include also a segmentation feature vector for each Gaussian. These can then be used for 3D scene segmentation, by clustering Gaussians according to their feature vectors; and to generate 2D segmentation masks, by projecting the Gaussians on a plane and $\alpha$ blending over their segmentation features. Using a combination of contrastive learning and spatial regularization, our method can be trained on inconsistent 2D segmentation masks, and still learn to generate segmentation masks consistent across all views. Moreover, the resulting model is extremely accurate, improving the IoU accuracy of the predicted masks by $+8\%$ over the state of the art. Code and trained models will be released soon.

Abstract (translated)

我们介绍了一种名为 Contrastive Gaussian Clustering 的新方法,它可以从任何视角提供分割掩码,并实现场景的 3D 分割。最近的新视图合成工作展示了如何通过 3D 高斯云来建模场景的 appearance,以及如何在给定视角上投影高斯并在 $\alpha$ 融合后生成准确图像的方法。遵循这个例子,我们训练了一个模型,每个高斯还包括一个分割特征向量。这些特征向量可以用于 3D 场景分割,通过根据其特征向量聚类高斯;还可以用于生成 2D 分割掩码,通过在平面上投影高斯并在其分割特征上进行 $\alpha$ 融合。通过对比学习与空间正则化,我们的方法可以在不一致的 2D 分割掩码上进行训练,同时仍然能在所有视角上生成一致的分割掩码。此外,所得模型非常准确,预测掩码的 IoU 准确率提高了 $+8\%$ 以上。代码和训练好的模型不久将发布。

URL

https://arxiv.org/abs/2404.12784

PDF

https://arxiv.org/pdf/2404.12784.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot