Paper Reading AI Learner

LP++: A Surprisingly Strong Linear Probe for Few-Shot CLIP

2024-04-02 20:23:10
Yunshi Huang, Fereshteh Shakeri, Jose Dolz, Malik Boudiaf, Houda Bahig, Ismail Ben Ayed

Abstract

In a recent, strongly emergent literature on few-shot CLIP adaptation, Linear Probe (LP) has been often reported as a weak baseline. This has motivated intensive research building convoluted prompt learning or feature adaptation strategies. In this work, we propose and examine from convex-optimization perspectives a generalization of the standard LP baseline, in which the linear classifier weights are learnable functions of the text embedding, with class-wise multipliers blending image and text knowledge. As our objective function depends on two types of variables, i.e., the class visual prototypes and the learnable blending parameters, we propose a computationally efficient block coordinate Majorize-Minimize (MM) descent algorithm. In our full-batch MM optimizer, which we coin LP++, step sizes are implicit, unlike standard gradient descent practices where learning rates are intensively searched over validation sets. By examining the mathematical properties of our loss (e.g., Lipschitz gradient continuity), we build majorizing functions yielding data-driven learning rates and derive approximations of the loss's minima, which provide data-informed initialization of the variables. Our image-language objective function, along with these non-trivial optimization insights and ingredients, yields, surprisingly, highly competitive few-shot CLIP performances. Furthermore, LP++ operates in black-box, relaxes intensive validation searches for the optimization hyper-parameters, and runs orders-of-magnitudes faster than state-of-the-art few-shot CLIP adaptation methods. Our code is available at: \url{this https URL}.

Abstract (translated)

在最近一篇关于少样本CLIP适应性的强烈涌现文献中,线性探测器(LP)通常被报道为弱基线。这激励了广泛的研究,基于卷积神经网络(CNN)的复杂提示学习或特征适应策略。在这项工作中,我们提出并探讨从凸优化角度来扩展标准LP基线,其中线性分类器权重是文本嵌入的可学习函数,带有一致性类别的乘积,融合图像和文本知识。由于我们的目标函数依赖于两种变量,即类视觉原型和学习可调整的融合参数,我们提出了一个计算效率的块级坐标最大-最小(MM)下降算法。在我們的完整批量的MM优化器中,步长是隐含的,而标准梯度下降方法在验证集上进行学习率的大幅搜索。通过研究我们损失的数学性质(例如Lipschitz梯度连续性),我们构建了产生数据驱动学习率的主要函数,并导出了损失最小值的近似,为变量的数据指导初始化。我们的人脸语言目标函数,以及这些非平凡优化见解和成分,出人意料地产生了高度竞争性的少样本CLIP性能。此外,LP++在黑盒模式下运行,放松了优化超参数的密集搜索,并且运行速度比最先进的少样本CLIP适应性方法快得多。我们的代码可在此处下载:\url{this https URL}.

URL

https://arxiv.org/abs/2404.02285

PDF

https://arxiv.org/pdf/2404.02285.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot