Paper Reading AI Learner

Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach

2024-04-17 20:35:00
Mir Rayat Imtiaz Hossain, Mennatullah Siam, Leonid Sigal, James J. Little

Abstract

The emergence of attention-based transformer models has led to their extensive use in various tasks, due to their superior generalization and transfer properties. Recent research has demonstrated that such models, when prompted appropriately, are excellent for few-shot inference. However, such techniques are under-explored for dense prediction tasks like semantic segmentation. In this work, we examine the effectiveness of prompting a transformer-decoder with learned visual prompts for the generalized few-shot segmentation (GFSS) task. Our goal is to achieve strong performance not only on novel categories with limited examples, but also to retain performance on base categories. We propose an approach to learn visual prompts with limited examples. These learned visual prompts are used to prompt a multiscale transformer decoder to facilitate accurate dense predictions. Additionally, we introduce a unidirectional causal attention mechanism between the novel prompts, learned with limited examples, and the base prompts, learned with abundant data. This mechanism enriches the novel prompts without deteriorating the base class performance. Overall, this form of prompting helps us achieve state-of-the-art performance for GFSS on two different benchmark datasets: COCO-$20^i$ and Pascal-$5^i$, without the need for test-time optimization (or transduction). Furthermore, test-time optimization leveraging unlabelled test data can be used to improve the prompts, which we refer to as transductive prompt tuning.

Abstract (translated)

基于注意力的Transformer模型的出现使得它们在各种任务中得到了广泛的应用,因为它们具有卓越的泛化和传输特性。最近的研究表明,当适当地提示下,这种模型在少样本推理任务中表现出色。然而,在类似于语义分割等密集预测任务中,这种技术尚未得到充分的探索。在这项工作中,我们研究了使用学习过的视觉提示 prompt a transformer-decoder 在GFSS(泛化少样本分割)任务中的效果。我们的目标是在既不丢失对少量示例的新兴类别的表现,也不影响基本类别的表现的情况下实现强大的性能。我们提出了通过学习有限示例的视觉提示来学习的方法。这些学习过的视觉提示被用于提示多尺度Transformer解码器,以实现准确的密集预测。此外,我们在新提示和基本提示之间引入了一种单向因果注意力机制。这种机制在丰富新提示的同时,没有削弱基本类别的表现。总体而言,这种提示方式有助于我们在两个不同的基准数据集上实现GFSS的尖端性能:COCO-$20^i$和Pascal-$5^i$,而无需进行测试时间的优化(或转换)。此外,利用未标记测试数据进行测试时间的优化可以进一步改善提示,我们称之为Transductive prompt tuning。

URL

https://arxiv.org/abs/2404.11732

PDF

https://arxiv.org/pdf/2404.11732.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot