Abstract
Transformer-based methods have demonstrated excellent performance on super-resolution visual tasks, surpassing conventional convolutional neural networks. However, existing work typically restricts self-attention computation to non-overlapping windows to save computational costs. This means that Transformer-based networks can only use input information from a limited spatial range. Therefore, a novel Hybrid Multi-Axis Aggregation network (HMA) is proposed in this paper to exploit feature potential information better. HMA is constructed by stacking Residual Hybrid Transformer Blocks(RHTB) and Grid Attention Blocks(GAB). On the one side, RHTB combines channel attention and self-attention to enhance non-local feature fusion and produce more attractive visual results. Conversely, GAB is used in cross-domain information interaction to jointly model similar features and obtain a larger perceptual field. For the super-resolution task in the training phase, a novel pre-training method is designed to enhance the model representation capabilities further and validate the proposed model's effectiveness through many experiments. The experimental results show that HMA outperforms the state-of-the-art methods on the benchmark dataset. We provide code and models at this https URL.
Abstract (translated)
基于Transformer的方法在超分辨率视觉任务上已经表现出卓越的性能,超过了传统的卷积神经网络。然而,现有的工作通常将自注意力计算限制在非重叠的窗口以节省计算成本。这意味着基于Transformer的网络只能利用输入信息的有限空间范围。因此,本文提出了一种新颖的混合多轴聚合网络(HMA)来更好地利用特征潜力信息。HMA由Residual Hybrid Transformer Blocks(RHTB)和Grid Attention Blocks(GAB)堆叠而成。一方面,RHTB通过结合通道关注和自注意力来增强非局部特征融合,产生更具吸引力的视觉结果。另一方面,GAB用于跨域信息交互,共同建模类似特征,获得更大的感知场。在训练阶段,为增强模型表示能力并验证所提出的模型的有效性,设计了一种新颖的预训练方法。实验结果表明,HMA在基准数据集上优于最先进的方法。我们在这个网址提供了代码和模型。
URL
https://arxiv.org/abs/2405.05001