Paper Reading AI Learner

ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts

2023-01-28 11:51:20
Kwanyoung Kim, Yujin Oh, Jong Chul Ye

Abstract

Recent success of large-scale Contrastive Language-Image Pre-training (CLIP) has led to great promise in zero-shot semantic segmentation by transferring image-text aligned knowledge to pixel-level classification. However, existing methods usually require an additional image encoder or retraining/tuning the CLIP module. Here, we present a cost-effective strategy using text-prompt learning that keeps the entire CLIP module frozen while fully leveraging its rich information. Specifically, we propose a novel Zero-shot segmentation with Optimal Transport (ZegOT) method that matches multiple text prompts with frozen image embeddings through optimal transport, which allows each text prompt to efficiently focus on specific semantic attributes. Additionally, we propose Deep Local Feature Alignment (DLFA) that deeply aligns the text prompts with intermediate local feature of the frozen image encoder layers, which significantly boosts the zero-shot segmentation performance. Through extensive experiments on benchmark datasets, we show that our method achieves the state-of-the-art (SOTA) performance with only x7 lighter parameters compared to previous SOTA approaches.

Abstract (translated)

大型对比语言-图像预训练(CLIP)的成功最近带来了在零样本语义分割方面的重大潜力,通过将图像-文本对齐的知识转移到像素级别分类中。然而,现有方法通常需要额外的图像编码器或重新训练/调整CLIP模块。在这里,我们提出了一种使用文本提示学习的成本效益型策略,使整个CLIP模块保持冻结,同时充分利用其丰富的信息。具体来说,我们提出了一种新的零样本分割方法,称为最优传输(ZegOT)方法,通过最优传输匹配多个文本提示与冻结图像嵌入,使其每个文本提示能够高效地专注于特定的语义属性。此外,我们提出了深度局部特征对齐(DLFA)方法,深度对齐文本提示与冻结图像编码层中的中间局部特征,显著增强零样本分割性能。通过在基准数据集上进行广泛的实验,我们表明,我们的方法和之前SOTA方法相比,仅使用了x7更轻松的参数,实现了最先进的性能。

URL

https://arxiv.org/abs/2301.12171

PDF

https://arxiv.org/pdf/2301.12171.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot