Abstract
CLIP has enabled new and exciting joint vision-language applications, one of which is open-vocabulary segmentation, which can locate any segment given an arbitrary text query. In our research, we ask whether it is possible to discover semantic segments without any user guidance in the form of text queries or predefined classes, and label them using natural language automatically? We propose a novel problem zero-guidance segmentation and the first baseline that leverages two pre-trained generalist models, DINO and CLIP, to solve this problem without any fine-tuning or segmentation dataset. The general idea is to first segment an image into small over-segments, encode them into CLIP's visual-language space, translate them into text labels, and merge semantically similar segments together. The key challenge, however, is how to encode a visual segment into a segment-specific embedding that balances global and local context information, both useful for recognition. Our main contribution is a novel attention-masking technique that balances the two contexts by analyzing the attention layers inside CLIP. We also introduce several metrics for the evaluation of this new task. With CLIP's innate knowledge, our method can precisely locate the Mona Lisa painting among a museum crowd. Project page: this https URL.
Abstract (translated)
CLIP已经实现了新的、令人兴奋的视觉语言联合应用,其中之一就是开放词汇分割,它可以在任何给定的文本查询中定位任意Segment。在我们的研究中,我们提出了一个新的问题:是否可以在没有用户指导的情况下,以文本查询或预先定义的类别形式发现语义Segment,并使用自然语言自动标签它们?我们提出了一个崭新的问题零指导分割和第一个基准,它利用DINO和CLIP两个预训练通用模型来解决这个问题,而不需要 Fine-tuning 或分割数据集。我们的主要想法是首先将图像分割成较小的过分割块,将它们编码到 CLIP 的视觉语言空间中,将其转换为文本标签,并将语义相似的Segment 合并在一起。然而,关键挑战是如何将一个视觉Segment 编码为特定的嵌入,以平衡全局和局部上下文信息,这对识别都有益。我们的主要贡献是一个新的注意遮蔽技术,通过分析 CLIP 内部的注意力层,平衡了两个上下文信息。我们还介绍了几个指标来评价这个新任务。利用 CLIP 固有的知识,我们的方法可以精确地在博物馆人群中定位蒙娜丽莎画作。项目页面:这个 https URL。
URL
https://arxiv.org/abs/2303.13396