Abstract
We present Reversible Vision Transformers, a memory efficient architecture design for visual recognition. By decoupling the GPU memory requirement from the depth of the model, Reversible Vision Transformers enable scaling up architectures with efficient memory usage. We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants and benchmark extensively across both model sizes and tasks of image classification, object detection and video classification. Reversible Vision Transformers achieve a reduced memory footprint of up to 15.5x at roughly identical model complexity, parameters and accuracy, demonstrating the promise of reversible vision transformers as an efficient backbone for hardware resource limited training regimes. Finally, we find that the additional computational burden of recomputing activations is more than overcome for deeper models, where throughput can increase up to 2.3x over their non-reversible counterparts. Full code and trained models are available at this https URL. A simpler, easy to understand and modify version is also available at this https URL
Abstract (translated)
我们提出了可逆的视觉转换器架构设计,用于视觉识别。通过将GPU内存要求与模型深度相分离,可逆的视觉转换器能够实现高效的内存使用架构的 scaling 。我们将两个流行的模型 Vision Transformer 和 Multiscale Vision Transformers 改编为可逆版本,并在模型大小和图像分类、物体检测和视频分类任务方面广泛基准测试。可逆的视觉转换器在模型复杂性、参数和准确性大致相同的情况下,实现了减少内存 footprint 高达 15.5 倍的卓越性能,这表明可逆的视觉转换器作为硬件资源有限训练体系结构的高效骨架的潜力。最后,我们发现对于更深层模型,重新计算激活器的额外计算负担已经远远超过了克服它的机会,在那里,Throughput 可以增加高达 2.3 倍于非可逆对应的模型。完整的代码和训练模型在此 https URL 上可用。更简单、易于理解和修改的版本也在此 https URL 上可用。
URL
https://arxiv.org/abs/2302.04869