Paper Reading AI Learner

Zero-guidance Segmentation Using Zero Segment Labels

2023-03-23 16:15:07
Pitchaporn Rewatbowornwong, Nattanat Chatthee, Ekapol Chuangsuwanich, Supasorn Suwajanakorn

Abstract

CLIP has enabled new and exciting joint vision-language applications, one of which is open-vocabulary segmentation, which can locate any segment given an arbitrary text query. In our research, we ask whether it is possible to discover semantic segments without any user guidance in the form of text queries or predefined classes, and label them using natural language automatically? We propose a novel problem zero-guidance segmentation and the first baseline that leverages two pre-trained generalist models, DINO and CLIP, to solve this problem without any fine-tuning or segmentation dataset. The general idea is to first segment an image into small over-segments, encode them into CLIP's visual-language space, translate them into text labels, and merge semantically similar segments together. The key challenge, however, is how to encode a visual segment into a segment-specific embedding that balances global and local context information, both useful for recognition. Our main contribution is a novel attention-masking technique that balances the two contexts by analyzing the attention layers inside CLIP. We also introduce several metrics for the evaluation of this new task. With CLIP's innate knowledge, our method can precisely locate the Mona Lisa painting among a museum crowd. Project page: this https URL.

Abstract (translated)

CLIP已经实现了新的、令人兴奋的视觉语言联合应用,其中之一就是开放词汇分割,它可以在任何给定的文本查询中定位任意Segment。在我们的研究中,我们提出了一个新的问题:是否可以在没有用户指导的情况下,以文本查询或预先定义的类别形式发现语义Segment,并使用自然语言自动标签它们?我们提出了一个崭新的问题零指导分割和第一个基准,它利用DINO和CLIP两个预训练通用模型来解决这个问题,而不需要 Fine-tuning 或分割数据集。我们的主要想法是首先将图像分割成较小的过分割块,将它们编码到 CLIP 的视觉语言空间中,将其转换为文本标签,并将语义相似的Segment 合并在一起。然而,关键挑战是如何将一个视觉Segment 编码为特定的嵌入,以平衡全局和局部上下文信息,这对识别都有益。我们的主要贡献是一个新的注意遮蔽技术,通过分析 CLIP 内部的注意力层,平衡了两个上下文信息。我们还介绍了几个指标来评价这个新任务。利用 CLIP 固有的知识,我们的方法可以精确地在博物馆人群中定位蒙娜丽莎画作。项目页面:这个 https URL。

URL

https://arxiv.org/abs/2303.13396

PDF

https://arxiv.org/pdf/2303.13396.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot