Long-Short Transformer: Efficient Transformers for Language and Vision

2021-07-05 18:00:14

Chen Zhu, Wei Ping, Chaowei Xiao, Mohammad Shoeybi, Tom Goldstein, Anima Anandkumar, Bryan Catanzaro

arXiv_CV

arXiv_CV Classification Attention Relation Language_Model Transformer Pose

Abstract
Abstract (translated)
URL
PDF

Abstract

Transformers have achieved success in both language and vision domains. However, it is prohibitively expensive to scale them to long sequences such as long documents or high-resolution images, because self-attention mechanism has quadratic time and memory complexities with respect to the input sequence length. In this paper, we propose Long-Short Transformer (Transformer-LS), an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks. It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations. We propose a dual normalization strategy to account for the scale mismatch between the two attention mechanisms. Transformer-LS can be applied to both autoregressive and bidirectional models without additional complexity. Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification. For instance, Transformer-LS achieves 0.97 test BPC on enwik8 using half the number of parameters than previous method, while being faster and is able to handle 3$\times$ as long sequences compared to its full-attention version on the same hardware. On ImageNet, it can obtain the state-of-the-art results~(e.g., Top-1 accuracy 84.1% trained on 224$\times$224 ImageNet-1K only), while being more scalable on high-resolution images. The models and source code will be released soon.

Abstract (translated)

URL

https://arxiv.org/abs/2107.02192

PDF

https://arxiv.org/pdf/2107.02192.pdf