Abstract
Compression of large and performant vision foundation models (VFMs) into arbitrary bit-wise operations (BitOPs) allows their deployment on various hardware. We propose to fine-tune a VFM to a mixed-precision quantized supernet. The supernet-based neural architecture search (NAS) can be adopted for this purpose, which trains a supernet, and then subnets within arbitrary hardware budgets can be extracted. However, existing methods face difficulties in optimizing the mixed-precision search space and incurring large memory costs during training. To tackle these challenges, first, we study the effective search space design for fine-tuning a VFM by comparing different operators (such as resolution, feature size, width, depth, and bit-widths) in terms of performance and BitOPs reduction. Second, we propose memory-efficient supernet training using a low-rank adapter (LoRA) and a progressive training strategy. The proposed method is evaluated for the recently proposed VFM, Segment Anything Model, fine-tuned on segmentation tasks. The searched model yields about a 95% reduction in BitOPs without incurring performance degradation.
Abstract (translated)
将大型和高性能的视觉基础模型(VFMs)转换为任意位运算(BitOPs)允许它们在各种硬件上部署。我们提出将VFM细分为混合精度量化超网络。基于超网络的神经架构搜索(NAS)可以为此目的做出贡献,它训练一个超网络,然后可以提取任意硬件预算的子网络。然而,现有的方法在优化混合精度搜索空间和训练过程中产生大量内存开销方面面临困难。为解决这些挑战,我们首先研究了通过比较不同操作(如分辨率、特征大小、宽度、深度和位宽)来优化VFM的有效搜索空间设计,以及减少BitOPs的性能和减少。然后,我们提出了使用低秩适配器(LoRA)和渐进式训练策略进行内存高效超网络训练的方法。对所提出的方法在最近提出的VFM模型——分割任何模型上进行评估。通过搜索到的模型,没有出现性能退化,BitOPs减少了约95%。
URL
https://arxiv.org/abs/2403.20080