Paper Reading AI Learner

Quality-aware Selective Fusion Network for V-D-T Salient Object Detection

2024-05-13 11:32:05
Liuxin Bao, Xiaofei Zhou, Xiankai Lu, Yaoqi Sun, Haibing Yin, Zhenghui Hu, Jiyong Zhang, Chenggang Yan

Abstract

Depth images and thermal images contain the spatial geometry information and surface temperature information, which can act as complementary information for the RGB modality. However, the quality of the depth and thermal images is often unreliable in some challenging scenarios, which will result in the performance degradation of the two-modal based salient object detection (SOD). Meanwhile, some researchers pay attention to the triple-modal SOD task, where they attempt to explore the complementarity of the RGB image, the depth image, and the thermal image. However, existing triple-modal SOD methods fail to perceive the quality of depth maps and thermal images, which leads to performance degradation when dealing with scenes with low-quality depth and thermal images. Therefore, we propose a quality-aware selective fusion network (QSF-Net) to conduct VDT salient object detection, which contains three subnets including the initial feature extraction subnet, the quality-aware region selection subnet, and the region-guided selective fusion subnet. Firstly, except for extracting features, the initial feature extraction subnet can generate a preliminary prediction map from each modality via a shrinkage pyramid architecture. Then, we design the weakly-supervised quality-aware region selection subnet to generate the quality-aware maps. Concretely, we first find the high-quality and low-quality regions by using the preliminary predictions, which further constitute the pseudo label that can be used to train this subnet. Finally, the region-guided selective fusion subnet purifies the initial features under the guidance of the quality-aware maps, and then fuses the triple-modal features and refines the edge details of prediction maps through the intra-modality and inter-modality attention (IIA) module and the edge refinement (ER) module, respectively. Extensive experiments are performed on VDT-2048

Abstract (translated)

深度图像和热图像包含空间几何信息和表面温度信息,这些信息可以为红外模态提供互补信息。然而,在某些具有挑战性的场景中,深度和热图像的质量通常不可靠,这将导致基于双模态的显着目标检测(SOD)性能下降。同时,一些研究人员关注三元模态的SOD任务,他们试图探讨RGB图像、深度图像和热图像的互补性。然而,现有的三元模态SOD方法无法感知深度图和热图的质量,因此在处理低质量深度和热图的场景时,性能会下降。因此,我们提出了一个质量感知的选择性融合网络(QSF-Net)来进行VDT显着目标检测,它包含三个子网络,包括初始特征提取子网、质量感知区域选择子网和区域引导的选择性融合子网。首先,除了提取特征外,初始特征提取子网可以通过收缩金字塔架构从每个模式生成初步预测图。然后,我们设计了一个弱监督的质量感知区域选择子网,用于生成质量感知图。具体来说,我们首先通过初步预测找到高质量和低质量的区域,这进一步构成了可以用于训练这个子网的伪标签。最后,在质量感知地图的指导下,区域引导选择性融合子网对初始特征进行净化,然后通过内模态和跨模态关注(ER)模块对预测地图的边缘进行细化。在VDT-2048上进行大量实验。

URL

https://arxiv.org/abs/2405.07655

PDF

https://arxiv.org/pdf/2405.07655.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot