Segment Anything Model Enhanced Pseudo Labels for Weakly Supervised Semantic Segmentation

Abstract
Abstract (translated)
URL
PDF

Abstract

Weakly Supervised Semantic Segmentation (WSSS) with only image-level supervision has garnered increasing attention due to its low annotation cost compared to pixel-level annotation. Most existing methods rely on Class Activation Maps (CAM) to generate pixel-level pseudo labels for supervised training. However, it is well known that CAM often suffers from partial activation -- activating the most discriminative part instead of the entire object area, and false activation -- unnecessarily activating the background around the object. In this study, we introduce a simple yet effective approach to address these limitations by harnessing the recently released Segment Anything Model (SAM) to generate higher-quality pseudo labels with CAM. SAM is a segmentation foundation model that demonstrates strong zero-shot ability in partitioning images into segments but lacks semantic labels for these regions. To circumvent this, we employ pseudo labels for a specific class as the signal to select the most relevant masks and label them to generate the refined pseudo labels for this class. The segments generated by SAM are highly precise, leading to substantial improvements in partial and false activation. Moreover, existing post-processing modules for producing pseudo labels, such as AffinityNet, are often computationally heavy, with a significantly long training time. Surprisingly, we discovered that using the initial CAM with SAM can achieve on-par performance as the post-processed pseudo label generated from these modules with much less computational cost. Our approach is highly versatile and capable of seamless integration into existing WSSS models without modification to base networks or pipelines. Despite its simplicity, our approach improves the mean Intersection over Union (mIoU) of pseudo labels from five state-of-the-art WSSS methods by 6.2\% on average on the PASCAL VOC 2012 dataset.

Abstract (translated)

仅基于图像级别的监督的语义分割(WSSS)已经吸引了越来越多的关注，因为与像素级别的监督相比，它的标注成本较低。大多数现有方法都依赖于类激活映射(CAM)来生成像素级别的伪标签进行监督训练。然而，众所周知，CAM常常遭受部分激活和错误激活的问题，即只激活最显著的部分而不是整个物体区域，以及错误激活，即不必要的激活周围物体的背景。在本研究中，我们提出了一种简单但有效的方法来解决这些限制，通过利用最近发布的Segment anything Model(SAM)来利用CAM生成更高质量的伪标签。SAM是一个分割基础模型，表现出强大的零样本能力，以分割图像为片段，但缺乏对这些区域的语义标签。为了绕过这个问题，我们使用特定类的伪标签作为信号，选择最相关的掩码，并将它们标签以生成该类的 refined伪标签。由SAM生成的片段非常精确，导致partial和错误激活的重大改善。此外，现有的用于生成伪标签的预处理模块，如AffinityNet，通常计算量很大，训练时间也非常长。令人惊讶地，我们发现，使用最初的CAM和SAM可以与从这些模块生成的预处理伪标签的性能相媲美，而计算成本却更低。我们的方法非常灵活，能够无缝融入现有的WSSS模型中，而无需修改基础网络或管道。尽管它的简单性，我们的方法在PASCAL VOC 2012数据集上平均提高了伪标签的平均Intersection over Union(mIoU)的6.2%。

URL

https://arxiv.org/abs/2305.05803

PDF

https://arxiv.org/pdf/2305.05803.pdf