Paper Reading AI Learner

Vocabulary-free Image Classification and Semantic Segmentation

2024-04-16 19:27:21
Alessandro Conti, Enrico Fini, Massimiliano Mancini, Paolo Rota, Yiming Wang, Elisa Ricci

Abstract

Large vision-language models revolutionized image classification and semantic segmentation paradigms. However, they typically assume a pre-defined set of categories, or vocabulary, at test time for composing textual prompts. This assumption is impractical in scenarios with unknown or evolving semantic context. Here, we address this issue and introduce the Vocabulary-free Image Classification (VIC) task, which aims to assign a class from an unconstrained language-induced semantic space to an input image without needing a known vocabulary. VIC is challenging due to the vastness of the semantic space, which contains millions of concepts, including fine-grained categories. To address VIC, we propose Category Search from External Databases (CaSED), a training-free method that leverages a pre-trained vision-language model and an external database. CaSED first extracts the set of candidate categories from the most semantically similar captions in the database and then assigns the image to the best-matching candidate category according to the same vision-language model. Furthermore, we demonstrate that CaSED can be applied locally to generate a coarse segmentation mask that classifies image regions, introducing the task of Vocabulary-free Semantic Segmentation. CaSED and its variants outperform other more complex vision-language models, on classification and semantic segmentation benchmarks, while using much fewer parameters.

Abstract (translated)

大视觉语言模型彻底颠覆了图像分类和语义分割范式。然而,它们通常在测试时假设一个预定义的词汇表,或词汇集,用于构建文本提示。在语义上下文未知或不断变化的情况下,这个假设是不实用的。在这里,我们解决了这个问题,并引入了无词汇图像分类(VIC)任务,该任务旨在将不受已知词汇表约束的语义空间中的类分配给输入图像。VIC 具有挑战性,因为语义空间非常广泛,包含数百万个概念,包括细粒度分类。为了应对 VIC,我们提出了从外部数据库中进行类别搜索(CaSED)的方法,这是一种训练免费的方法,它利用了一个预训练的视觉语言模型和外部数据库。 CaSED 首先从数据库中提取出最具语义相似性的捕捉到的候选类,然后根据相同的视觉语言模型将图像分配给最佳匹配的候选类。此外,我们还证明了 CaSED 可以局部应用于生成一个粗分割掩码,对图像区域进行分类,从而引入了词汇无语义分割任务。CaSED 和它的变体在分类和语义分割基准测试中优于其他更复杂的视觉语言模型,同时使用了更少的参数。

URL

https://arxiv.org/abs/2404.10864

PDF

https://arxiv.org/pdf/2404.10864.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot