Optimizing the Deployment of Tiny Transformers on Low-Power MCUs

Abstract
Abstract (translated)
URL
PDF

Abstract

Transformer networks are rapidly becoming SotA in many fields, such as NLP and CV. Similarly to CNN, there is a strong push for deploying Transformer models at the extreme edge, ultimately fitting the tiny power budget and memory footprint of MCUs. However, the early approaches in this direction are mostly ad-hoc, platform, and model-specific. This work aims to enable and optimize the flexible, multi-platform deployment of encoder Tiny Transformers on commercial MCUs. We propose a complete framework to perform end-to-end deployment of Transformer models onto single and multi-core MCUs. Our framework provides an optimized library of kernels to maximize data reuse and avoid unnecessary data marshaling operations into the crucial attention block. A novel MHSA inference schedule, named Fused-Weight Self-Attention, is introduced, fusing the linear projection weights offline to further reduce the number of operations and parameters. Furthermore, to mitigate the memory peak reached by the computation of the attention map, we present a Depth-First Tiling scheme for MHSA. We evaluate our framework on three different MCU classes exploiting ARM and RISC-V ISA, namely the STM32H7, the STM32L4, and GAP9 (RV32IMC-XpulpV2). We reach an average of 4.79x and 2.0x lower latency compared to SotA libraries CMSIS-NN (ARM) and PULP-NN (RISC-V), respectively. Moreover, we show that our MHSA depth-first tiling scheme reduces the memory peak by up to 6.19x, while the fused-weight attention can reduce the runtime by 1.53x, and number of parameters by 25%. We report significant improvements across several Tiny Transformers: for instance, when executing a transformer block for the task of radar-based hand-gesture recognition on GAP9, we achieve a latency of 0.14ms and energy consumption of 4.92 micro-joules, 2.32x lower than the SotA PULP-NN library on the same platform.

Abstract (translated)

Transformer网络正在迅速成为NLP和CV领域的SOTA。与CNN相似，在边缘部署Transformer模型是一个强烈的推动力，最终适应了MCU的微弱能量消耗和内存开销。然而，目前这种方向上的早期方法大多是临时的、针对特定平台和模型的。这项工作旨在实现并优化在商业MCU上灵活、多平台部署编码器Tiny Transformers。我们提出了一个完整的框架来执行端到端Transformer模型在单和多核MCU上的部署。我们的框架提供了一个优化的库内核以最大化数据重用并避免不必要的数据移动操作到关键的注意力模块。我们还引入了一种名为Fused-Weight Self-Attention的新奇MHSA推理计划，它将离线线性投影权重与内存中的数据相融合，从而进一步减少了操作和参数数量。此外，为了减轻计算注意力图的内存高峰，我们提出了深度优先的Tiling方案用于MHSA。我们在ARM和RISC-V ISA架构的三个不同MCU类上评估我们的框架，分别是STM32H7、STM32L4和GAP9（RV32IMC-XpulpV2）。我们分别达到与SOTA库CMSIS-NN（ARM）和PULP-NN（RISC-V）分别为4.79x和2.0x的低延迟。此外，我们还证明了我们的MHSA深度优先贴片方案可以将内存高峰降低6.19x，而融合权重注意力可以降低运行时1.53x，并降低参数数量25%。我们在多个Tiny Transformers上取得了显著的改善：例如，在GAP9上执行基于雷达的手势识别的Transformer模块时，我们达到的延迟为0.14ms，能量消耗为4.92微焦，比同一平台的SOTA PULP-NN库低2.32倍。

URL

https://arxiv.org/abs/2404.02945

PDF

https://arxiv.org/pdf/2404.02945.pdf

Optimizing the Deployment of Tiny Transformers on Low-Power MCUs

Abstract

Abstract (translated)

URL

PDF Copy

PDF