Vocabulary-free Image Classification and Semantic Segmentation

2024-04-16 19:27:21
Alessandro Conti, Enrico Fini, Massimiliano Mancini, Paolo Rota, Yiming Wang, Elisa Ricci


Large vision-language models revolutionized image classification and semantic segmentation paradigms. However, they typically assume a pre-defined set of categories, or vocabulary, at test time for composing textual prompts. This assumption is impractical in scenarios with unknown or evolving semantic context. Here, we address this issue and introduce the Vocabulary-free Image Classification (VIC) task, which aims to assign a class from an unconstrained language-induced semantic space to an input image without needing a known vocabulary. VIC is challenging due to the vastness of the semantic space, which contains millions of concepts, including fine-grained categories. To address VIC, we propose Category Search from External Databases (CaSED), a training-free method that leverages a pre-trained vision-language model and an external database. CaSED first extracts the set of candidate categories from the most semantically similar captions in the database and then assigns the image to the best-matching candidate category according to the same vision-language model. Furthermore, we demonstrate that CaSED can be applied locally to generate a coarse segmentation mask that classifies image regions, introducing the task of Vocabulary-free Semantic Segmentation. CaSED and its variants outperform other more complex vision-language models, on classification and semantic segmentation benchmarks, while using much fewer parameters.

Abstract (translated)

大视觉语言模型彻底颠覆了图像分类和语义分割范式。然而,它们通常在测试时假设一个预定义的词汇表,或词汇集,用于构建文本提示。在语义上下文未知或不断变化的情况下,这个假设是不实用的。在这里,我们解决了这个问题,并引入了无词汇图像分类(VIC)任务,该任务旨在将不受已知词汇表约束的语义空间中的类分配给输入图像。VIC 具有挑战性,因为语义空间非常广泛,包含数百万个概念,包括细粒度分类。为了应对 VIC,我们提出了从外部数据库中进行类别搜索(CaSED)的方法,这是一种训练免费的方法,它利用了一个预训练的视觉语言模型和外部数据库。 CaSED 首先从数据库中提取出最具语义相似性的捕捉到的候选类,然后根据相同的视觉语言模型将图像分配给最佳匹配的候选类。此外,我们还证明了 CaSED 可以局部应用于生成一个粗分割掩码,对图像区域进行分类,从而引入了词汇无语义分割任务。CaSED 和它的变体在分类和语义分割基准测试中优于其他更复杂的视觉语言模型,同时使用了更少的参数。



