Transformer in Convolutional Neural Networks

2021-06-06 17:01:13

Yun Liu, Guolei Sun, Yu Qiu, Le Zhang, Ajad Chhatkuli, Luc Van Gool

arXiv_CV

arXiv_CV CNN Recognition Attention Relation Transformer Pose

Abstract
Abstract (translated)
URL
PDF

Abstract

We tackle the low-efficiency flaw of vision transformer caused by the high computational/space complexity in Multi-Head Self-Attention (MHSA). To this end, we propose the Hierarchical MHSA (H-MHSA), whose representation is computed in a hierarchical manner. Specifically, our H-MHSA first learns feature relationships within small grids by viewing image patches as tokens. Then, small grids are merged into larger ones, within which feature relationship is learned by viewing each small grid at the preceding step as a token. This process is iterated to gradually reduce the number of tokens. The H-MHSA module is readily pluggable into any CNN architectures and amenable to training via backpropagation. We call this new backbone TransCNN, and it essentially inherits the advantages of both transformer and CNN. Experiments demonstrate that TransCNN achieves state-of-the-art accuracy for image recognition. Code and pretrained models are available at this https URL. This technical report will keep updating by adding more experiments.

Abstract (translated)

URL

https://arxiv.org/abs/2106.03180

PDF

https://arxiv.org/pdf/2106.03180.pdf