Abstract
In recent years, Vision Transformers (ViTs) have shown promising classification performance over Convolutional Neural Networks (CNNs) due to their self-attention mechanism. Many researchers have incorporated ViTs for Hyperspectral Image (HSI) classification. HSIs are characterised by narrow contiguous spectral bands, providing rich spectral data. Although ViTs excel with sequential data, they cannot extract spectral-spatial information like CNNs. Furthermore, to have high classification performance, there should be a strong interaction between the HSI token and the class (CLS) token. To solve these issues, we propose a 3D-Convolution guided Spectral-Spatial Transformer (3D-ConvSST) for HSI classification that utilizes a 3D-Convolution Guided Residual Module (CGRM) in-between encoders to "fuse" the local spatial and spectral information and to enhance the feature propagation. Furthermore, we forego the class token and instead apply Global Average Pooling, which effectively encodes more discriminative and pertinent high-level features for classification. Extensive experiments have been conducted on three public HSI datasets to show the superiority of the proposed model over state-of-the-art traditional, convolutional, and Transformer models. The code is available at this https URL.
Abstract (translated)
近年来,由于自注意力机制,Vision Transformers (ViTs) 在卷积神经网络(CNNs)上表现出了良好的分类性能。许多研究者将ViTs应用于高光谱图像(HSI)分类。HSIs的特点是狭窄的连续频带,提供丰富的光谱数据。尽管ViTs在序列数据上表现出色,但它们无法像CNNs一样提取光谱-空间信息。此外,为了获得高分类性能,HSI令牌与类(CLS)令牌之间应该存在强烈的相互作用。为解决这些问题,我们提出了一个3D卷积引导的高光谱空间Transformer(3D-ConvSST)用于HSI分类,该模型在编码器之间利用3D卷积引导残差模块(CGRM)来“融合”局部空间和光谱信息,并增强特征传播。此外,我们摒弃了类标签,而是应用全局平均池化,这有效地为分类编码更具有区分性和相关性的高级特征。在三个公开的HSI数据集上进行了广泛的实验,以证明与最先进的传统卷积、转换器模型相比,所提出的模型具有优越性。代码可在此https URL上获取。
URL
https://arxiv.org/abs/2404.13252