Abstract
Recently, adapting Vision Language Models (VLMs) to zero-shot visual classification by tuning class embedding with a few prompts (Test-time Prompt Tuning, TPT) or replacing class names with generated visual samples (support-set) has shown promising results. However, TPT cannot avoid the semantic gap between modalities while the support-set cannot be tuned. To this end, we draw on each other's strengths and propose a novel framework namely TEst-time Support-set Tuning for zero-shot Video Classification (TEST-V). It first dilates the support-set with multiple prompts (Multi-prompting Support-set Dilation, MSD) and then erodes the support-set via learnable weights to mine key cues dynamically (Temporal-aware Support-set Erosion, TSE). Specifically, i) MSD expands the support samples for each class based on multiple prompts enquired from LLMs to enrich the diversity of the support-set. ii) TSE tunes the support-set with factorized learnable weights according to the temporal prediction consistency in a self-supervised manner to dig pivotal supporting cues for each class. $\textbf{TEST-V}$ achieves state-of-the-art results across four benchmarks and has good interpretability for the support-set dilation and erosion.
Abstract (translated)
最近,通过使用少量提示调整类嵌入(测试时提示调优,TPT)或用生成的视觉样本替换类名(支持集),将视觉语言模型(VLMs)适应于零样本视觉分类显示出有希望的结果。然而,TPT无法避免模态间的语义差距,而支持集则不能被调整。为此,我们借鉴彼此的优势,提出了一种新颖的框架,即用于零样本视频分类的测试时支持集调优 (TEST-V)。该框架首先使用多个提示(多提示支持集膨胀,MSD)来扩展支持集,并通过可学习权重对支持集进行侵蚀以动态挖掘每个类的关键线索(时间感知支持集侵蚀,TSE)。具体来说: i) MSD 通过从大型语言模型(LLMs)获取的多个提示为基础,为每个类别扩展现有的支持样本,从而丰富了支持集的多样性。 ii) TSE 则使用因子化可学习权重根据自监督的方式进行时间预测一致性来调整支持集,以挖掘对每个类至关重要的支撑线索。 **TEST-V 在四个基准测试上取得了最先进的结果,并且对于支持集的膨胀和侵蚀过程具有良好的解释性。**
URL
https://arxiv.org/abs/2502.00426