PVTv2: Improved Baselines with Pyramid Vision Transformer

2021-06-25 17:51:09

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao

arXiv_CV

arXiv_CV Segmentation Detection Classification Attention Transformer

Abstract
Abstract (translated)
URL
PDF

Abstract

Transformer in computer vision has recently shown encouraging progress. In this work, we improve the original Pyramid Vision Transformer (PVTv1) by adding three improvement designs, which include (1) locally continuous features with convolutions, (2) position encodings with zero paddings, and (3) linear complexity attention layers with average pooling. With these simple modifications, our PVTv2 significantly improves PVTv1 on classification, detection, and segmentation. Moreover, PVTv2 achieves much better performance than recent works, including Swin Transformer, under ImageNet-1K pre-training. We hope this work will make state-of-the-art vision Transformer research more accessible. Code is available at this https URL .

Abstract (translated)

URL

https://arxiv.org/abs/2106.13797

PDF

https://arxiv.org/pdf/2106.13797.pdf