Augmenting Convolutional networks with attention-based aggregation

2021-12-27 14:05:41

Hugo Touvron, Matthieu Cord, Alaaeldin El-Nouby, Piotr Bojanowski, Armand Joulin, Gabriel Synnaeve, Hervé Jégou

arXiv_CV

arXiv_CV Segmentation CNN Detection Classification Attention Transformer

Abstract
Abstract (translated)
URL
PDF

Abstract

We show how to augment any convolutional network with an attention-based global map to achieve non-local reasoning. We replace the final average pooling by an attention-based aggregation layer akin to a single transformer block, that weights how the patches are involved in the classification decision. We plug this learned aggregation layer with a simplistic patch-based convolutional network parametrized by 2 parameters (width and depth). In contrast with a pyramidal design, this architecture family maintains the input patch resolution across all the layers. It yields surprisingly competitive trade-offs between accuracy and complexity, in particular in terms of memory consumption, as shown by our experiments on various computer vision tasks: object classification, image segmentation and detection.

Abstract (translated)

URL

https://arxiv.org/abs/2112.13692

PDF

https://arxiv.org/pdf/2112.13692.pdf