Paper Reading AI Learner

Transformer-based RGB-T Tracking with Channel and Spatial Feature Fusion

2024-05-06 05:58:49
Yunfeng Li, Bo Wang, Ye Li, Zhiwen Yu, Liang Wang

Abstract

Complementary RGB and TIR modalities enable RGB-T tracking to achieve competitive performance in challenging scenarios. Therefore, how to better fuse cross-modal features is the core issue of RGB-T tracking. Some previous methods either insufficiently fuse RGB and TIR features, or depend on intermediaries containing information from both modalities to achieve cross-modal information interaction. The former does not fully exploit the potential of using only RGB and TIR information of the template or search region for channel and spatial feature fusion, and the latter lacks direct interaction between the template and search area, which limits the model's ability to fully exploit the original semantic information of both modalities. To alleviate these limitations, we explore how to improve the performance of a visual Transformer by using direct fusion of cross-modal channels and spatial features, and propose CSTNet. CSTNet uses ViT as a backbone and inserts cross-modal channel feature fusion modules (CFM) and cross-modal spatial feature fusion modules (SFM) for direct interaction between RGB and TIR features. The CFM performs parallel joint channel enhancement and joint multilevel spatial feature modeling of RGB and TIR features and sums the features, and then globally integrates the sum feature with the original features. The SFM uses cross-attention to model the spatial relationship of cross-modal features and then introduces a convolutional feedforward network for joint spatial and channel integration of multimodal features. Comprehensive experiments show that CSTNet achieves state-of-the-art performance on three public RGB-T tracking benchmarks. Code is available at this https URL.

Abstract (translated)

互补的RGB和TIR模式使RGB-T跟踪在具有挑战性的场景中实现竞争力的性能。因此,如何更好地融合跨模态特征是RGB-T跟踪的核心问题。之前的方法要么不足以充分利用仅使用模板或搜索区域的RGB和TIR信息进行通道和空间特征融合,要么依赖于包含来自两个模态信息的中间体以实现跨模态信息交互。前者没有充分利用使用仅基于RGB和TIR信息的模板或搜索区域进行通道和空间特征融合的潜力,而后者缺乏直接模板和搜索区域之间的交互,从而限制了模型对两种模态原始语义信息的充分利用能力。为了减轻这些限制,我们探讨了如何通过直接融合跨模态通道和空间特征来提高视觉Transformer的性能,并提出了CSTNet。CSTNet使用ViT作为骨干网络,并插入跨模态通道特征融合模块(CFM)和跨模态空间特征融合模块(SFM)进行直接交互,CFM对RGB和TIR特征进行并行联合通道增强和多级空间特征建模,然后将特征加总并全局整合与原始特征。SFM利用跨注意力和一个卷积前馈网络对多模态特征进行联合空间和通道整合。全面的实验结果表明,CSTNet在三个公开的RGB-T跟踪基准上实现了最先进的性能。代码可以从该链接下载。

URL

https://arxiv.org/abs/2405.03177

PDF

https://arxiv.org/pdf/2405.03177.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot