Paper Reading AI Learner

HAPNet: Toward Superior RGB-Thermal Scene Parsing via Hybrid, Asymmetric, and Progressive Heterogeneous Feature Fusion

2024-04-04 15:31:11
Jiahang Li, Peng Yun, Qijun Chen, Rui Fan

Abstract

Data-fusion networks have shown significant promise for RGB-thermal scene parsing. However, the majority of existing studies have relied on symmetric duplex encoders for heterogeneous feature extraction and fusion, paying inadequate attention to the inherent differences between RGB and thermal modalities. Recent progress in vision foundation models (VFMs) trained through self-supervision on vast amounts of unlabeled data has proven their ability to extract informative, general-purpose features. However, this potential has yet to be fully leveraged in the domain. In this study, we take one step toward this new research area by exploring a feasible strategy to fully exploit VFM features for RGB-thermal scene parsing. Specifically, we delve deeper into the unique characteristics of RGB and thermal modalities, thereby designing a hybrid, asymmetric encoder that incorporates both a VFM and a convolutional neural network. This design allows for more effective extraction of complementary heterogeneous features, which are subsequently fused in a dual-path, progressive manner. Moreover, we introduce an auxiliary task to further enrich the local semantics of the fused features, thereby improving the overall performance of RGB-thermal scene parsing. Our proposed HAPNet, equipped with all these components, demonstrates superior performance compared to all other state-of-the-art RGB-thermal scene parsing networks, achieving top ranks across three widely used public RGB-thermal scene parsing datasets. We believe this new paradigm has opened up new opportunities for future developments in data-fusion scene parsing approaches.

Abstract (translated)

数据融合网络在色温场景解析方面表现出巨大的潜力。然而,现有的研究大多依赖于对称的多层解码器来进行异构特征提取和融合,而忽略了红光和热模态固有的差异。在通过自监督学习大量无标签数据上训练的视觉基础模型(VFMs)的最近进步证明,它们具有提取有信息量的通用特征的能力。然而,在领域内这一潜力尚未得到充分利用。在这项研究中,我们迈出这一新研究领域的一步,通过探索一种可行的策略,充分利用VFM特征进行红光-热场景解析。具体来说,我们深入研究了红光和热模态的独特特点,从而设计了一个半监督的 asymmetric 编码器,该编码器既包含一个VFM,也包含一个卷积神经网络。这种设计允许更有效地提取互补的异质特征,然后以双路、逐步的方式进行融合。此外,我们还引入了一个辅助任务,进一步丰富了融合特征的局部语义,从而提高了整个RGB-热场景解析的性能。我们提出的HAPNet,配备了所有这些组件,在所有其他最先进的RGB-热场景解析网络中表现出卓越的性能,在三处广泛使用的公共RGB-热场景解析数据集上实现了Top Rank。我们相信,这一新范式为数据融合场景解析方法的未来发展打开了新的机会。

URL

https://arxiv.org/abs/2404.03527

PDF

https://arxiv.org/pdf/2404.03527.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot