Abstract
Recent success of large-scale Contrastive Language-Image Pre-training (CLIP) has led to great promise in zero-shot semantic segmentation by transferring image-text aligned knowledge to pixel-level classification. However, existing methods usually require an additional image encoder or retraining/tuning the CLIP module. Here, we present a cost-effective strategy using text-prompt learning that keeps the entire CLIP module frozen while fully leveraging its rich information. Specifically, we propose a novel Zero-shot segmentation with Optimal Transport (ZegOT) method that matches multiple text prompts with frozen image embeddings through optimal transport, which allows each text prompt to efficiently focus on specific semantic attributes. Additionally, we propose Deep Local Feature Alignment (DLFA) that deeply aligns the text prompts with intermediate local feature of the frozen image encoder layers, which significantly boosts the zero-shot segmentation performance. Through extensive experiments on benchmark datasets, we show that our method achieves the state-of-the-art (SOTA) performance with only x7 lighter parameters compared to previous SOTA approaches.
Abstract (translated)
大型对比语言-图像预训练(CLIP)的成功最近带来了在零样本语义分割方面的重大潜力,通过将图像-文本对齐的知识转移到像素级别分类中。然而,现有方法通常需要额外的图像编码器或重新训练/调整CLIP模块。在这里,我们提出了一种使用文本提示学习的成本效益型策略,使整个CLIP模块保持冻结,同时充分利用其丰富的信息。具体来说,我们提出了一种新的零样本分割方法,称为最优传输(ZegOT)方法,通过最优传输匹配多个文本提示与冻结图像嵌入,使其每个文本提示能够高效地专注于特定的语义属性。此外,我们提出了深度局部特征对齐(DLFA)方法,深度对齐文本提示与冻结图像编码层中的中间局部特征,显著增强零样本分割性能。通过在基准数据集上进行广泛的实验,我们表明,我们的方法和之前SOTA方法相比,仅使用了x7更轻松的参数,实现了最先进的性能。
URL
https://arxiv.org/abs/2301.12171