Paper Reading AI Learner

TeST-V: TEst-time Support-set Tuning for Zero-shot Video Classification

2025-02-01 13:20:01
Rui Yan, Jin Wang, Hongyu Qu, Xiaoyu Du, Dong Zhang, Jinhui Tang, Tieniu Tan

Abstract

Recently, adapting Vision Language Models (VLMs) to zero-shot visual classification by tuning class embedding with a few prompts (Test-time Prompt Tuning, TPT) or replacing class names with generated visual samples (support-set) has shown promising results. However, TPT cannot avoid the semantic gap between modalities while the support-set cannot be tuned. To this end, we draw on each other's strengths and propose a novel framework namely TEst-time Support-set Tuning for zero-shot Video Classification (TEST-V). It first dilates the support-set with multiple prompts (Multi-prompting Support-set Dilation, MSD) and then erodes the support-set via learnable weights to mine key cues dynamically (Temporal-aware Support-set Erosion, TSE). Specifically, i) MSD expands the support samples for each class based on multiple prompts enquired from LLMs to enrich the diversity of the support-set. ii) TSE tunes the support-set with factorized learnable weights according to the temporal prediction consistency in a self-supervised manner to dig pivotal supporting cues for each class. $\textbf{TEST-V}$ achieves state-of-the-art results across four benchmarks and has good interpretability for the support-set dilation and erosion.

Abstract (translated)

最近,通过使用少量提示调整类嵌入(测试时提示调优,TPT)或用生成的视觉样本替换类名(支持集),将视觉语言模型(VLMs)适应于零样本视觉分类显示出有希望的结果。然而,TPT无法避免模态间的语义差距,而支持集则不能被调整。为此,我们借鉴彼此的优势,提出了一种新颖的框架,即用于零样本视频分类的测试时支持集调优 (TEST-V)。该框架首先使用多个提示(多提示支持集膨胀,MSD)来扩展支持集,并通过可学习权重对支持集进行侵蚀以动态挖掘每个类的关键线索(时间感知支持集侵蚀,TSE)。具体来说: i) MSD 通过从大型语言模型(LLMs)获取的多个提示为基础,为每个类别扩展现有的支持样本,从而丰富了支持集的多样性。 ii) TSE 则使用因子化可学习权重根据自监督的方式进行时间预测一致性来调整支持集,以挖掘对每个类至关重要的支撑线索。 **TEST-V 在四个基准测试上取得了最先进的结果,并且对于支持集的膨胀和侵蚀过程具有良好的解释性。**

URL

https://arxiv.org/abs/2502.00426

PDF

https://arxiv.org/pdf/2502.00426.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot