Paper Reading AI Learner

Reversible Vision Transformers

2023-02-09 18:59:54
Karttikeya Mangalam, Haoqi Fan, Yanghao Li, Chao-Yuan Wu, Bo Xiong, Christoph Feichtenhofer, Jitendra Malik

Abstract

We present Reversible Vision Transformers, a memory efficient architecture design for visual recognition. By decoupling the GPU memory requirement from the depth of the model, Reversible Vision Transformers enable scaling up architectures with efficient memory usage. We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants and benchmark extensively across both model sizes and tasks of image classification, object detection and video classification. Reversible Vision Transformers achieve a reduced memory footprint of up to 15.5x at roughly identical model complexity, parameters and accuracy, demonstrating the promise of reversible vision transformers as an efficient backbone for hardware resource limited training regimes. Finally, we find that the additional computational burden of recomputing activations is more than overcome for deeper models, where throughput can increase up to 2.3x over their non-reversible counterparts. Full code and trained models are available at this https URL. A simpler, easy to understand and modify version is also available at this https URL

Abstract (translated)

我们提出了可逆的视觉转换器架构设计,用于视觉识别。通过将GPU内存要求与模型深度相分离,可逆的视觉转换器能够实现高效的内存使用架构的 scaling 。我们将两个流行的模型 Vision Transformer 和 Multiscale Vision Transformers 改编为可逆版本,并在模型大小和图像分类、物体检测和视频分类任务方面广泛基准测试。可逆的视觉转换器在模型复杂性、参数和准确性大致相同的情况下,实现了减少内存 footprint 高达 15.5 倍的卓越性能,这表明可逆的视觉转换器作为硬件资源有限训练体系结构的高效骨架的潜力。最后,我们发现对于更深层模型,重新计算激活器的额外计算负担已经远远超过了克服它的机会,在那里,Throughput 可以增加高达 2.3 倍于非可逆对应的模型。完整的代码和训练模型在此 https URL 上可用。更简单、易于理解和修改的版本也在此 https URL 上可用。

URL

https://arxiv.org/abs/2302.04869

PDF

https://arxiv.org/pdf/2302.04869.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot