Exploring Attention Map Reuse for Efficient Transformer Neural Networks

Abstract
Abstract (translated)
URL
PDF

Abstract

Transformer-based deep neural networks have achieved great success in various sequence applications due to their powerful ability to model long-range dependency. The key module of Transformer is self-attention (SA) which extracts features from the entire sequence regardless of the distance between positions. Although SA helps Transformer performs particularly well on long-range tasks, SA requires quadratic computation and memory complexity with the input sequence length. Recently, attention map reuse, which groups multiple SA layers to share one attention map, has been proposed and achieved significant speedup for speech recognition models. In this paper, we provide a comprehensive study on attention map reuse focusing on its ability to accelerate inference. We compare the method with other SA compression techniques and conduct a breakdown analysis of its advantages for a long sequence. We demonstrate the effectiveness of attention map reuse by measuring the latency on both CPU and GPU platforms.

Abstract (translated)

基于Transformer的深度神经网络在各种序列应用中取得了巨大的成功,因为它们强大的能力模型长距离依赖关系。Transformer的关键模块是自注意力(SA),从整个序列中无论位置之间的距离提取特征。虽然SA有助于Transformer在远程任务中表现特别出色,但SA需要输入序列长度的平方计算和内存复杂性。最近,注意力地图重排,将多个SA层合并成一个注意力地图,并用于语音识别模型的显著速度提升,已提出并实现了。在本文中,我们将提供一份注意力地图重排的全面研究,重点探讨它加速推理的能力。我们与其他SA压缩技术进行比较,并对一个较长的序列的优点进行了分解分析。我们通过测量CPU和GPU平台上的延迟来演示注意力地图重排的有效性。

URL

https://arxiv.org/abs/2301.12444

PDF

https://arxiv.org/pdf/2301.12444.pdf