Paper Reading AI Learner

Mixed-precision Supernet Training from Vision Foundation Models using Low Rank Adapter

2024-03-29 09:22:44
Yuiko Sakuma, Masakazu Yoshimura, Junji Otsuka, Atsushi Irie, Takeshi Ohashi

Abstract

Compression of large and performant vision foundation models (VFMs) into arbitrary bit-wise operations (BitOPs) allows their deployment on various hardware. We propose to fine-tune a VFM to a mixed-precision quantized supernet. The supernet-based neural architecture search (NAS) can be adopted for this purpose, which trains a supernet, and then subnets within arbitrary hardware budgets can be extracted. However, existing methods face difficulties in optimizing the mixed-precision search space and incurring large memory costs during training. To tackle these challenges, first, we study the effective search space design for fine-tuning a VFM by comparing different operators (such as resolution, feature size, width, depth, and bit-widths) in terms of performance and BitOPs reduction. Second, we propose memory-efficient supernet training using a low-rank adapter (LoRA) and a progressive training strategy. The proposed method is evaluated for the recently proposed VFM, Segment Anything Model, fine-tuned on segmentation tasks. The searched model yields about a 95% reduction in BitOPs without incurring performance degradation.

Abstract (translated)

将大型和高性能的视觉基础模型(VFMs)转换为任意位运算(BitOPs)允许它们在各种硬件上部署。我们提出将VFM细分为混合精度量化超网络。基于超网络的神经架构搜索(NAS)可以为此目的做出贡献,它训练一个超网络,然后可以提取任意硬件预算的子网络。然而,现有的方法在优化混合精度搜索空间和训练过程中产生大量内存开销方面面临困难。为解决这些挑战,我们首先研究了通过比较不同操作(如分辨率、特征大小、宽度、深度和位宽)来优化VFM的有效搜索空间设计,以及减少BitOPs的性能和减少。然后,我们提出了使用低秩适配器(LoRA)和渐进式训练策略进行内存高效超网络训练的方法。对所提出的方法在最近提出的VFM模型——分割任何模型上进行评估。通过搜索到的模型,没有出现性能退化,BitOPs减少了约95%。

URL

https://arxiv.org/abs/2403.20080

PDF

https://arxiv.org/pdf/2403.20080.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot