Abstract
Existing works on open-vocabulary semantic segmentation have utilized large-scale vision-language models, such as CLIP, to leverage their exceptional open-vocabulary recognition capabilities. However, the problem of transferring these capabilities learned from image-level supervision to the pixel-level task of segmentation and addressing arbitrary unseen categories at inference makes this task challenging. To address these issues, we aim to attentively relate objects within an image to given categories by leveraging relational information among class categories and visual semantics through aggregation, while also adapting the CLIP representations to the pixel-level task. However, we observe that direct optimization of the CLIP embeddings can harm its open-vocabulary capabilities. In this regard, we propose an alternative approach to optimize the image-text similarity map, i.e. the cost map, using a novel cost aggregation-based method. Our framework, namely CAT-Seg, achieves state-of-the-art performance across all benchmarks. We provide extensive ablation studies to validate our choices. Project page: this https URL.
Abstract (translated)
现有的开放词汇语义分割工作使用了大型视觉语言模型,如CLIP,来利用其卓越的开放词汇识别能力。然而,将从图像级别监督学习的能力转移到像素级别分割任务并处理任意 unseen 类别Inference 的问题使得任务变得具有挑战性。为了解决这些问题,我们旨在仔细关注图像中的对象与给定类别的关系,通过利用类类别之间的关系信息和视觉语义的聚合来利用关系信息,同时适应Clip表示以适应像素级别任务。然而,我们观察到直接优化Clip嵌入可能会损害其开放词汇能力。因此,我们提出了一种替代方法,使用一种独特的成本聚合方法来优化图像-文本相似度地图,即成本地图,我们的框架名为CAT-Seg,在所有基准测试中取得了最先进的性能。我们提供了广泛的对比研究来验证我们的选择。项目页面:这个https URL。
URL
https://arxiv.org/abs/2303.11797