Abstract
We introduce Gaga, a framework that reconstructs and segments open-world 3D scenes by leveraging inconsistent 2D masks predicted by zero-shot segmentation models. Contrasted to prior 3D scene segmentation approaches that heavily rely on video object tracking, Gaga utilizes spatial information and effectively associates object masks across diverse camera poses. By eliminating the assumption of continuous view changes in training images, Gaga demonstrates robustness to variations in camera poses, particularly beneficial for sparsely sampled images, ensuring precise mask label consistency. Furthermore, Gaga accommodates 2D segmentation masks from diverse sources and demonstrates robust performance with different open-world zero-shot segmentation models, enhancing its versatility. Extensive qualitative and quantitative evaluations demonstrate that Gaga performs favorably against state-of-the-art methods, emphasizing its potential for real-world applications such as scene understanding and manipulation.
Abstract (translated)
我们提出了Gaga,一个通过利用零散拍摄模型预测的不一致2D掩码来重构和分割开放世界3D场景的框架。与 prior 3D 场景分割方法相比,Gaga 利用空间信息,有效地将对象掩码关联到不同相机姿态。通过消除训练图像中连续视图变化的假设,Gaga 展示了对相机姿态变化的鲁棒性,特别是对于稀疏采样图像,确保了精确的掩码标签一致性。此外,Gaga 适应了各种来源的2D分割掩码,并展示了与不同开放世界零散拍摄模型具有良好的鲁棒性,增强了其多才性。大量的定性和定量评估证实,Gaga 在与最先进方法的表现上具有优势,突出了其在现实应用场景(如场景理解和操作)上的潜力。
URL
https://arxiv.org/abs/2404.07977