Abstract
Few-shot learning has been successfully applied to medical image classification as only very few medical examples are available for training. Due to the challenging problem of limited number of annotated medical images, image representations should not be solely derived from a single image modality which is insufficient for characterizing concept classes. In this paper, we propose a new prompting multi-modal model paradigm on medical image classification based on multi-modal foundation models, called PM2. Besides image modality,PM2 introduces another supplementary text input, known as prompt, to further describe corresponding image or concept classes and facilitate few-shot learning across diverse modalities. To better explore the potential of prompt engineering, we empirically investigate five distinct prompt schemes under the new paradigm. Furthermore, linear probing in multi-modal models acts as a linear classification head taking as input only class token, which ignores completely merits of rich statistics inherent in high-level visual tokens. Thus, we alternatively perform a linear classification on feature distribution of visual tokens and class token simultaneously. To effectively mine such rich statistics, a global covariance pooling with efficient matrix power normalization is used to aggregate visual tokens. Then we study and combine two classification heads. One is shared for class token of image from vision encoder and prompt representation encoded by text encoder. The other is to classification on feature distribution of visual tokens from vision encoder. Extensive experiments on three medical datasets show that our PM2 significantly outperforms counterparts regardless of prompt schemes and achieves state-of-the-art performance.
Abstract (translated)
少样本学习在医学图像分类中的应用已经取得成功,因为可用的医学图像非常少。由于有限数量的注释医学图像具有挑战性的问题,图像表示不应该仅从单个图像模态中提取,这种模态不足以描述概念类别。在本文中,我们基于多模态基础模型提出了一种新的提示多模态模型范例,称为PM2。除了图像模态外,PM2还引入了另一个补充文本输入,称为提示,以进一步描述相应的图像或概念类别,并促进跨不同模态的少样本学习。为了更好地探索提示工程潜力,我们在新范例下实证研究了五种不同的提示方案。此外,在多模态模型中,线性探测在仅接收类标签的情况下充当线性分类头,这忽略了高层次视觉标签中蕴含的丰富统计数据。因此,我们分别对视觉元数据的分布和类标签进行线性分类。为了有效地挖掘这些丰富统计数据,采用全局协方差池化与高效的矩阵功率归一化对视觉元数据进行聚合。然后,我们研究并组合两个分类头。一个共享于从视觉编码器获得的图像类标签和由文本编码器编码的提示表示。另一个是只对视觉编码器获得的视觉元数据进行分类。在三个医疗数据集上进行的广泛实验证明,我们的PM2在提示方案不同的情况下显著优于对照组,并实现了最先进的性能。
URL
https://arxiv.org/abs/2404.08915