Abstract
Despite the rapid evolution of semantic segmentation for land cover classification in high-resolution remote sensing imagery, integrating multiple data modalities such as Digital Surface Model (DSM), RGB, and Near-infrared (NIR) remains a challenge. Current methods often process only two types of data, missing out on the rich information that additional modalities can provide. Addressing this gap, we propose a novel \textbf{L}ightweight \textbf{M}ultimodal data \textbf{F}usion \textbf{Net}work (LMFNet) to accomplish the tasks of fusion and semantic segmentation of multimodal remote sensing images. LMFNet uniquely accommodates various data types simultaneously, including RGB, NirRG, and DSM, through a weight-sharing, multi-branch vision transformer that minimizes parameter count while ensuring robust feature extraction. Our proposed multimodal fusion module integrates a \textit{Multimodal Feature Fusion Reconstruction Layer} and \textit{Multimodal Feature Self-Attention Fusion Layer}, which can reconstruct and fuse multimodal features. Extensive testing on public datasets such as US3D, ISPRS Potsdam, and ISPRS Vaihingen demonstrates the effectiveness of LMFNet. Specifically, it achieves a mean Intersection over Union ($mIoU$) of 85.09\% on the US3D dataset, marking a significant improvement over existing methods. Compared to unimodal approaches, LMFNet shows a 10\% enhancement in $mIoU$ with only a 0.5M increase in parameter count. Furthermore, against bimodal methods, our approach with trilateral inputs enhances $mIoU$ by 0.46 percentage points.
Abstract (translated)
尽管在高分辨率遥感图像中,语义分割的快速进化为土地覆盖分类带来了便利,但整合多种数据模态(如数字表面模型(DSM)、红光和近红外(NIR))仍然具有挑战性。目前的方法通常仅处理两种数据类型,而忽略了其他模态提供的丰富信息。为解决这一空白,我们提出了一个新颖的轻量级多模态数据融合网络(LMFNet)来执行多模态遥感图像的融合和语义分割任务。LMFNet独特地将各种数据类型同时集成在一个加权共享、多分支视觉 transformer中,在确保参数数量的同时最小化特征提取。我们提出的多模态融合模块包括一个多模态特征融合重构层和一个多模态特征自注意融合层,可以重构和融合多模态特征。对包括US3D、ISPRS Potsdam和ISPRS Vaihingen等公共数据集的广泛测试表明,LMFNet的有效性得到了充分验证。具体来说,在美国3D数据集上,LMFNet的平均交集 over 统一(mIoU)值为85.09\%,相较于现有方法有显著的改进。与单模态方法相比,LMFNet在参数数量仅增加0.5M的情况下,mIoU提高了10\%。此外,与二模态方法相比,我们的具有三角形输入的方法提高了mIoU的0.46个百分点。
URL
https://arxiv.org/abs/2404.13659