Paper Reading AI Learner

GrFormer: A Novel Transformer on Grassmann Manifold for Infrared and Visible Image Fusion

2025-06-17 10:32:05
Huan Kang, Hui Li, Xiao-Jun Wu, Tianyang Xu, Rui Wang, Chunyang Cheng, Josef Kittler

Abstract

In the field of image fusion, promising progress has been made by modeling data from different modalities as linear subspaces. However, in practice, the source images are often located in a non-Euclidean space, where the Euclidean methods usually cannot encapsulate the intrinsic topological structure. Typically, the inner product performed in the Euclidean space calculates the algebraic similarity rather than the semantic similarity, which results in undesired attention output and a decrease in fusion performance. While the balance of low-level details and high-level semantics should be considered in infrared and visible image fusion task. To address this issue, in this paper, we propose a novel attention mechanism based on Grassmann manifold for infrared and visible image fusion (GrFormer). Specifically, our method constructs a low-rank subspace mapping through projection constraints on the Grassmann manifold, compressing attention features into subspaces of varying rank levels. This forces the features to decouple into high-frequency details (local low-rank) and low-frequency semantics (global low-rank), thereby achieving multi-scale semantic fusion. Additionally, to effectively integrate the significant information, we develop a cross-modal fusion strategy (CMS) based on a covariance mask to maximise the complementary properties between different modalities and to suppress the features with high correlation, which are deemed redundant. The experimental results demonstrate that our network outperforms SOTA methods both qualitatively and quantitatively on multiple image fusion benchmarks. The codes are available at this https URL.

Abstract (translated)

在图像融合领域,通过将不同模式的数据建模为线性子空间的方法取得了显著进展。然而,在实践中,源图像通常位于非欧几里得空间中,其中欧氏方法通常无法捕捉到内在的拓扑结构。典型地,在欧几里得空间中执行的内积计算的是代数相似度而非语义相似度,这导致了不希望有的注意力输出以及融合性能降低的问题。同时,在红外与可见光图像融合任务中应该考虑到低级细节和高级语义之间的平衡。 为了解决这个问题,本文提出了一种基于Grassmann流形的新型注意机制(称为GrFormer),用于红外和可见光图像融合。具体来说,我们的方法通过在Grassmann流形上的投影约束来构建低秩子空间映射,将注意力特征压缩到不同等级的子空间中。这迫使特性分解为高频细节(局部低秩)和低频语义(全局低秩),从而实现多尺度语义融合。此外,为了有效地整合重要信息,我们开发了一种基于协方差掩码的跨模态融合策略(CMS),以最大化不同模式之间的互补属性,并抑制高度相关的特性,这些特性的冗余被认为可以被剔除。 实验结果表明,在多个图像融合基准上,我们的网络在定性和定量评估中都优于现有最先进的方法。代码可在提供的链接处获取。

URL

https://arxiv.org/abs/2506.14384

PDF

https://arxiv.org/pdf/2506.14384.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot