Abstract
Despite recent competitive performance across a range of vision tasks, vision Transformers still have an issue of heavy computational costs. Recently, vision prompt learning has provided an economic solution to this problem without fine-tuning the whole large-scale models. However, the efficiency of existing models are still far from satisfactory due to insertion of extensive prompts blocks and trick prompt designs. In this paper, we propose an efficient vision model named impLicit vIsion prOmpt tuNing (LION), which is motivated by deep implicit models with stable memory costs for various complex tasks. In particular, we merely insect two equilibrium implicit layers in two ends of the pre-trained main backbone with parameters in the backbone frozen. Moreover, we prune the parameters in these two layers according to lottery hypothesis. The performance obtained by our LION are promising on a wide range of datasets. In particular, our LION reduces up to 11.5% of training parameter numbers while obtaining higher performance compared with the state-of-the-art baseline VPT, especially under challenging scenes. Furthermore, we find that our proposed LION had a good generalization performance, making it an easy way to boost transfer learning in the future.
Abstract (translated)
尽管近年来在多种视觉任务上表现出了竞争力,视觉变换器仍然面临着计算成本高昂的问题。最近,视觉prompt learning已经提供了解决这个问题的经济性解决方案,而不需要对整个大型模型进行微调。然而,现有模型的效率仍然远远不足以令人满意,因为引入了大量prompt blocks和技巧prompt设计。在本文中,我们提出了一种高效的视觉模型,名为promptive vIsion prediction (LION),其动机是基于深度隐含模型,在不同复杂任务中具有稳定的内存成本。特别地,我们仅在训练主骨架的两端设置了两个平衡的隐含层,并将参数冻结在骨架中。此外,我们根据彩票假设修剪了这两个层的参数。我们LION在多种数据集上表现出 promising 的性能。特别是,我们的LION在比最先进的基线 VPT 更高的表现下,特别是在挑战性场景中的表现上取得了显著的降低。此外,我们发现我们提出的LION具有良好的泛化性能,使其成为未来提高迁移学习的简单方法。
URL
https://arxiv.org/abs/2303.09992