Abstract
We present Seg-TTO, a novel framework for zero-shot, open-vocabulary semantic segmentation (OVSS), designed to excel in specialized domain tasks. While current open vocabulary approaches show impressive performance on standard segmentation benchmarks under zero-shot settings, they fall short of supervised counterparts on highly domain-specific datasets. We focus on segmentation-specific test-time optimization to address this gap. Segmentation requires an understanding of multiple concepts within a single image while retaining the locality and spatial structure of representations. We propose a novel self-supervised objective adhering to these requirements and use it to align the model parameters with input images at test time. In the textual modality, we learn multiple embeddings for each category to capture diverse concepts within an image, while in the visual modality, we calculate pixel-level losses followed by embedding aggregation operations specific to preserving spatial structure. Our resulting framework termed Seg-TTO is a plug-in-play module. We integrate Seg-TTO with three state-of-the-art OVSS approaches and evaluate across 22 challenging OVSS tasks covering a range of specialized domains. Our Seg-TTO demonstrates clear performance improvements across these establishing new state-of-the-art. Code: this https URL.
Abstract (translated)
我们提出了一种名为Seg-TTO的创新框架,旨在解决零样本、开放词汇语义分割(OVSS)中的专门领域任务。尽管当前的开放式词汇方法在零样本设置下的标准分割基准测试中表现出色,但它们在高度特定领域的数据集上却不如监督模型。为了解决这一差距,我们专注于在测试时进行细分特性的优化。 分割任务需要理解单个图像内的多个概念,并同时保持表示中的局部性和空间结构。为此,我们提出了一种新颖的自监督目标,该目标符合这些要求,并利用它将模型参数与输入图像对齐以用于测试时间。在文本模态中,为每个类别学习多种嵌入来捕捉图像内多样化的概念;而在视觉模态中,则计算像素级别的损失并执行特定于保持空间结构的嵌入聚合操作。 我们的框架Seg-TTO被设计成可插拔模块,能够与现有的最先进的OVSS方法无缝集成。我们将Seg-TTO整合到三种最先进的OVSS方法中,并在涵盖各种专门领域内的22个具有挑战性的任务上进行了评估。结果显示,我们提出的Seg-TTO在性能上有明显改进,确立了新的最先进水平。 代码链接:[请在此处插入实际的URL链接]
URL
https://arxiv.org/abs/2501.04696