Abstract
The recent segmentation foundation model, Segment Anything Model (SAM), exhibits strong zero-shot segmentation capabilities, but it falls short in generating fine-grained precise masks. To address this limitation, we propose a novel zero-shot image matting model, called ZIM, with two key contributions: First, we develop a label converter that transforms segmentation labels into detailed matte labels, constructing the new SA1B-Matte dataset without costly manual annotations. Training SAM with this dataset enables it to generate precise matte masks while maintaining its zero-shot capability. Second, we design the zero-shot matting model equipped with a hierarchical pixel decoder to enhance mask representation, along with a prompt-aware masked attention mechanism to improve performance by enabling the model to focus on regions specified by visual prompts. We evaluate ZIM using the newly introduced MicroMat-3K test set, which contains high-quality micro-level matte labels. Experimental results show that ZIM outperforms existing methods in fine-grained mask generation and zero-shot generalization. Furthermore, we demonstrate the versatility of ZIM in various downstream tasks requiring precise masks, such as image inpainting and 3D NeRF. Our contributions provide a robust foundation for advancing zero-shot matting and its downstream applications across a wide range of computer vision tasks. The code is available at \url{this https URL}.
Abstract (translated)
近期的分割基础模型,即Segment Anything Model (SAM),展示了强大的零样本分割能力,但在生成精细准确的掩码方面表现不佳。为解决这一局限性,我们提出了一种新的零样本图像抠图模型ZIM,并有两项关键贡献:首先,我们开发了一个标签转换器,可以将分割标签转化为详细的抠图标签,构建了无需昂贵的人工标注的新数据集SA1B-Matte。用此数据集训练SAM使其能够生成精确的抠图掩码,同时保持零样本能力。其次,我们设计了一种配备分层像素解码器的零样本抠图模型以增强掩码表示,并加入了一个提示感知的掩蔽注意力机制,通过使模型专注于视觉提示指定的区域来提升性能。我们在新引入的MicroMat-3K测试集上评估了ZIM,该数据集包含高质量的微级别抠图标签。实验结果表明,ZIM在精细掩模生成和零样本泛化方面优于现有方法。此外,我们展示了ZIM在需要精确掩码的各种下游任务中的灵活性,例如图像修复和3D NeRF。我们的贡献为推动零样本抠图及其广泛计算机视觉任务的下游应用提供了坚实的基础。代码可在\url{this https URL}获取。
URL
https://arxiv.org/abs/2411.00626