Paper Reading AI Learner

PM2: A New Prompting Multi-modal Model Paradigm for Few-shot Medical Image Classification

2024-04-13 07:27:06
Zhenwei Wang, Qiule Sun, Bingbing Zhang, Pengfei Wang, Jianxin Zhang, Qiang Zhang

Abstract

Few-shot learning has been successfully applied to medical image classification as only very few medical examples are available for training. Due to the challenging problem of limited number of annotated medical images, image representations should not be solely derived from a single image modality which is insufficient for characterizing concept classes. In this paper, we propose a new prompting multi-modal model paradigm on medical image classification based on multi-modal foundation models, called PM2. Besides image modality,PM2 introduces another supplementary text input, known as prompt, to further describe corresponding image or concept classes and facilitate few-shot learning across diverse modalities. To better explore the potential of prompt engineering, we empirically investigate five distinct prompt schemes under the new paradigm. Furthermore, linear probing in multi-modal models acts as a linear classification head taking as input only class token, which ignores completely merits of rich statistics inherent in high-level visual tokens. Thus, we alternatively perform a linear classification on feature distribution of visual tokens and class token simultaneously. To effectively mine such rich statistics, a global covariance pooling with efficient matrix power normalization is used to aggregate visual tokens. Then we study and combine two classification heads. One is shared for class token of image from vision encoder and prompt representation encoded by text encoder. The other is to classification on feature distribution of visual tokens from vision encoder. Extensive experiments on three medical datasets show that our PM2 significantly outperforms counterparts regardless of prompt schemes and achieves state-of-the-art performance.

Abstract (translated)

少样本学习在医学图像分类中的应用已经取得成功,因为可用的医学图像非常少。由于有限数量的注释医学图像具有挑战性的问题,图像表示不应该仅从单个图像模态中提取,这种模态不足以描述概念类别。在本文中,我们基于多模态基础模型提出了一种新的提示多模态模型范例,称为PM2。除了图像模态外,PM2还引入了另一个补充文本输入,称为提示,以进一步描述相应的图像或概念类别,并促进跨不同模态的少样本学习。为了更好地探索提示工程潜力,我们在新范例下实证研究了五种不同的提示方案。此外,在多模态模型中,线性探测在仅接收类标签的情况下充当线性分类头,这忽略了高层次视觉标签中蕴含的丰富统计数据。因此,我们分别对视觉元数据的分布和类标签进行线性分类。为了有效地挖掘这些丰富统计数据,采用全局协方差池化与高效的矩阵功率归一化对视觉元数据进行聚合。然后,我们研究并组合两个分类头。一个共享于从视觉编码器获得的图像类标签和由文本编码器编码的提示表示。另一个是只对视觉编码器获得的视觉元数据进行分类。在三个医疗数据集上进行的广泛实验证明,我们的PM2在提示方案不同的情况下显著优于对照组,并实现了最先进的性能。

URL

https://arxiv.org/abs/2404.08915

PDF

https://arxiv.org/pdf/2404.08915.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot