Paper Reading AI Learner

ElasticViT: Conflict-aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices

2023-03-17 02:19:28
Chen Tang, Li Lyna Zhang, Huiqiang Jiang, Jiahang Xu, Ting Cao, Quanlu Zhang, Yuqing Yang, Zhi Wang, Mao Yang

Abstract

Neural Architecture Search (NAS) has shown promising performance in the automatic design of vision transformers (ViT) exceeding 1G FLOPs. However, designing lightweight and low-latency ViT models for diverse mobile devices remains a big challenge. In this work, we propose ElasticViT, a two-stage NAS approach that trains a high-quality ViT supernet over a very large search space that supports a wide range of mobile devices, and then searches an optimal sub-network (subnet) for direct deployment. However, prior supernet training methods that rely on uniform sampling suffer from the gradient conflict issue: the sampled subnets can have vastly different model sizes (e.g., 50M vs. 2G FLOPs), leading to different optimization directions and inferior performance. To address this challenge, we propose two novel sampling techniques: complexity-aware sampling and performance-aware sampling. Complexity-aware sampling limits the FLOPs difference among the subnets sampled across adjacent training steps, while covering different-sized subnets in the search space. Performance-aware sampling further selects subnets that have good accuracy, which can reduce gradient conflicts and improve supernet quality. Our discovered models, ElasticViT models, achieve top-1 accuracy from 67.2% to 80.0% on ImageNet from 60M to 800M FLOPs without extra retraining, outperforming all prior CNNs and ViTs in terms of accuracy and latency. Our tiny and small models are also the first ViT models that surpass state-of-the-art CNNs with significantly lower latency on mobile devices. For instance, ElasticViT-S1 runs 2.62x faster than EfficientNet-B0 with 0.1% higher accuracy.

Abstract (translated)

神经网络架构搜索(NAS)在自动设计超过1G FLOP的视觉转换器(ViT)方面表现出令人瞩目的性能。然而,为各种移动设备设计轻量级、低延迟的ViT模型仍然是一个重大的挑战。在本工作中,我们提出了 ElasticViT,一种两阶段的NAS方法,在该方法中训练了一个高质量的ViT超级网络,在支持多种移动设备的极大搜索空间上运行,然后搜索最佳的子网络(子网)进行直接部署。然而,以前的超级网络训练方法依赖于均匀采样,却面临着梯度冲突问题。被采样的子网络可能具有极大的模型大小(例如,50M vs. 2G FLOPs),导致不同的优化方向和较差的性能。为了解决这个问题,我们提出了两个新的采样技术:复杂度意识采样和性能意识采样。复杂度意识采样限制了相邻训练步骤中子网络样本之间的FLOP差异,同时覆盖搜索空间中的不同大小子网络。性能意识采样进一步选择了具有良好精度的子网络,这可以减少梯度冲突并改善超级网络质量。我们发现的模型,ElasticViT模型,在ImageNet上从67.2%到80.0%的准确率,在没有额外的训练的情况下,从60M到800M FLOPs的精度范围内取得了比先前的CNN和ViT更高的性能和延迟,成为在移动设备上比当前最先进的CNNs更快、延迟更低的ViT模型的先驱。我们的小型模型也是第一个在移动设备上超越最先进的CNNs并具有更低延迟的ViT模型。例如,ElasticViT-S1的运行速度比EfficientNet-B0快2.62倍,准确率提高了0.1%。

URL

https://arxiv.org/abs/2303.09730

PDF

https://arxiv.org/pdf/2303.09730.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot