Abstract
Vision Transformers (ViT) have marked a paradigm shift in computer vision, outperforming state-of-the-art models across diverse tasks. However, their practical deployment is hampered by high computational and memory demands. This study addresses the challenge by evaluating four primary model compression techniques: quantization, low-rank approximation, knowledge distillation, and pruning. We methodically analyze and compare the efficacy of these techniques and their combinations in optimizing ViTs for resource-constrained environments. Our comprehensive experimental evaluation demonstrates that these methods facilitate a balanced compromise between model accuracy and computational efficiency, paving the way for wider application in edge computing devices.
Abstract (translated)
Vision Transformers (ViT) 已经在计算机视觉领域取得了范式转移,超越了在各种任务上 state-of-the-art 模型的表现。然而,它们的实际部署受到高计算和内存需求的阻碍。为了解决这个问题,本研究评估了四种主要模型压缩技术:量化、低秩近似、知识蒸馏和剪枝。我们系统地分析和比较这些技术以及它们在优化 ViT 时实现的最佳平衡点。我们全面的实验评估证明,这些方法促进在资源受限的环境中实现模型准确性和计算效率的平衡,为在边缘计算设备上更广泛的应用铺平道路。
URL
https://arxiv.org/abs/2404.10407