Paper Reading AI Learner

LoRA-TTT: Low-Rank Test-Time Training for Vision-Language Models

2025-02-04 07:40:26
Yuto Kojima, Jiarui Xu, Xueyan Zou, Xiaolong Wang

Abstract

The rapid advancements in vision-language models (VLMs), such as CLIP, have intensified the need to address distribution shifts between training and testing datasets. Although prior Test-Time Training (TTT) techniques for VLMs have demonstrated robust performance, they predominantly rely on tuning text prompts, a process that demands substantial computational resources and is heavily dependent on entropy-based loss. In this paper, we propose LoRA-TTT, a novel TTT method that leverages Low-Rank Adaptation (LoRA), applied exclusively to the image encoder of VLMs. By introducing LoRA and updating only its parameters during test time, our method offers a simple yet effective TTT approach, retaining the model's initial generalization capability while achieving substantial performance gains with minimal memory and runtime overhead. Additionally, we introduce a highly efficient reconstruction loss tailored for TTT. Our method can adapt to diverse domains by combining these two losses, without increasing memory consumption or runtime. Extensive experiments on two benchmarks, covering 15 datasets, demonstrate that our method improves the zero-shot top-1 accuracy of CLIP-ViT-B/16 by an average of 5.79% on the OOD benchmark and 1.36% on the fine-grained benchmark, efficiently surpassing test-time prompt tuning, without relying on any external models or cache.

Abstract (translated)

视觉语言模型(VLMs),如CLIP,的快速进步加剧了训练数据集与测试数据集之间分布差异问题的需求。尽管先前针对VLMs的Test-Time Training (TTT) 技术已经表现出强大的性能,但这些技术主要依赖于调整文本提示,并且这个过程需要大量的计算资源并且严重依赖于基于熵的损失函数。在本文中,我们提出了LoRA-TTT,这是一种新颖的TTT方法,它利用了低秩适应(Low-Rank Adaptation, LoRA),仅应用于VLMs的图像编码器上。通过引入LoRA并在测试期间仅更新其参数,我们的方法提供了一种简单而有效的TTT方案,在保持模型初始泛化能力的同时实现了显著的性能提升,并且只需要极小的内存和运行时开销。此外,我们还提出了一种高度高效的适用于TTT的重建损失函数。通过结合这两种损失函数,我们的方法能够在不增加内存消耗或运行时间的情况下适应不同的领域。在两个基准测试上进行的广泛实验(包括15个数据集)表明,与依赖外部模型或缓存的测试时提示调整相比,我们的方法能够提高CLIP-ViT-B/16在OOD基准上的零样本top-1准确率平均提高了5.79%,并且在细粒度基准上提升了1.36%。

URL

https://arxiv.org/abs/2502.02069

PDF

https://arxiv.org/pdf/2502.02069.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot