Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach

Abstract
Abstract (translated)
URL
PDF

Abstract

The emergence of attention-based transformer models has led to their extensive use in various tasks, due to their superior generalization and transfer properties. Recent research has demonstrated that such models, when prompted appropriately, are excellent for few-shot inference. However, such techniques are under-explored for dense prediction tasks like semantic segmentation. In this work, we examine the effectiveness of prompting a transformer-decoder with learned visual prompts for the generalized few-shot segmentation (GFSS) task. Our goal is to achieve strong performance not only on novel categories with limited examples, but also to retain performance on base categories. We propose an approach to learn visual prompts with limited examples. These learned visual prompts are used to prompt a multiscale transformer decoder to facilitate accurate dense predictions. Additionally, we introduce a unidirectional causal attention mechanism between the novel prompts, learned with limited examples, and the base prompts, learned with abundant data. This mechanism enriches the novel prompts without deteriorating the base class performance. Overall, this form of prompting helps us achieve state-of-the-art performance for GFSS on two different benchmark datasets: COCO-$20^i$ and Pascal-$5^i$, without the need for test-time optimization (or transduction). Furthermore, test-time optimization leveraging unlabelled test data can be used to improve the prompts, which we refer to as transductive prompt tuning.

Abstract (translated)

基于注意力的Transformer模型的出现使得它们在各种任务中得到了广泛的应用，因为它们具有卓越的泛化和传输特性。最近的研究表明，当适当地提示下，这种模型在少样本推理任务中表现出色。然而，在类似于语义分割等密集预测任务中，这种技术尚未得到充分的探索。在这项工作中，我们研究了使用学习过的视觉提示 prompt a transformer-decoder 在GFSS（泛化少样本分割）任务中的效果。我们的目标是在既不丢失对少量示例的新兴类别的表现，也不影响基本类别的表现的情况下实现强大的性能。我们提出了通过学习有限示例的视觉提示来学习的方法。这些学习过的视觉提示被用于提示多尺度Transformer解码器，以实现准确的密集预测。此外，我们在新提示和基本提示之间引入了一种单向因果注意力机制。这种机制在丰富新提示的同时，没有削弱基本类别的表现。总体而言，这种提示方式有助于我们在两个不同的基准数据集上实现GFSS的尖端性能：COCO-$20^i$和Pascal-$5^i$，而无需进行测试时间的优化（或转换）。此外，利用未标记测试数据进行测试时间的优化可以进一步改善提示，我们称之为Transductive prompt tuning。

URL

https://arxiv.org/abs/2404.11732

PDF

https://arxiv.org/pdf/2404.11732.pdf

Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach

Abstract

Abstract (translated)

URL

PDF Copy

PDF