Paper Reading AI Learner

MobileNetV4 - Universal Models for the Mobile Ecosystem

2024-04-16 12:41:25
Danfeng Qin, Chas Leichner, Manolis Delakis, Marco Fornoni, Shixin Luo, Fan Yang, Weijun Wang, Colby Banbury, Chengxi Ye, Berkin Akin, Vaibhav Aggarwal, Tenghui Zhu, Daniele Moro, Andrew Howard

Abstract

We present the latest generation of MobileNets, known as MobileNetV4 (MNv4), featuring universally efficient architecture designs for mobile devices. At its core, we introduce the Universal Inverted Bottleneck (UIB) search block, a unified and flexible structure that merges Inverted Bottleneck (IB), ConvNext, Feed Forward Network (FFN), and a novel Extra Depthwise (ExtraDW) variant. Alongside UIB, we present Mobile MQA, an attention block tailored for mobile accelerators, delivering a significant 39% speedup. An optimized neural architecture search (NAS) recipe is also introduced which improves MNv4 search effectiveness. The integration of UIB, Mobile MQA and the refined NAS recipe results in a new suite of MNv4 models that are mostly Pareto optimal across mobile CPUs, DSPs, GPUs, as well as specialized accelerators like Apple Neural Engine and Google Pixel EdgeTPU - a characteristic not found in any other models tested. Finally, to further boost accuracy, we introduce a novel distillation technique. Enhanced by this technique, our MNv4-Hybrid-Large model delivers 87% ImageNet-1K accuracy, with a Pixel 8 EdgeTPU runtime of just 3.8ms.

Abstract (translated)

我们向您介绍最新的 MobileNets generation,被称为 MobileNetV4(MNv4),它具有移动设备上普遍高效的架构设计。在其核心,我们引入了统一且灵活的结构 - 逆瓶颈(UIB)搜索块,将逆瓶颈(IB)、ConvNext、Feed Forward Network(FFN)和新颖的 Extra Depthwise(ExtraDW)结合在一起。与 UIB 一起,我们还介绍了 Mobile MQA,一种专为移动加速器设计的关注块,实现了显著的 39% 的速度提升。此外,还引入了优化的神经架构搜索(NAS)食谱,提高了 MNv4 的搜索效果。将 UIB、移动 MQA 和优化的 NAS 食谱相结合,产生了在移动 CPUs、DSPs、GPUs 和专用加速器(如苹果 Neural Engine 和谷歌 Pixel EdgeTPU)上普遍最优的 MNv4 模型 - 这是其他模型中没有发现的特征。最后,为了进一步提高准确性,我们引入了一种新的蒸馏技术。通过这种技术,我们的 MNv4-Hybrid-Large 模型在 ImageNet-1K 上的准确率达到了 87%,Pixel 8 EdgeTPU 运行时间仅为 3.8ms。

URL

https://arxiv.org/abs/2404.10518

PDF

https://arxiv.org/pdf/2404.10518.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot