Abstract
Salient object detection (SOD) aims at finding the most salient objects in images and outputs pixel-level binary masks. Transformer-based methods achieve promising performance due to their global semantic understanding, crucial for identifying salient objects. However, these models tend to be large and require numerous training parameters. To better harness the potential of transformers for SOD, we propose a novel parameter-efficient fine-tuning method aimed at reducing the number of training parameters while enhancing the salient object detection capability. Our model, termed EXternal Prompt features Enhanced adapteR Tuning (ExPert), features an encoder-decoder structure with adapters and injectors interspersed between the layers of a frozen transformer encoder. The adapter modules adapt the pre-trained backbone to SOD while the injector modules incorporate external prompt features to enhance the awareness of salient objects. Comprehensive experiments demonstrate the superiority of our method. Surpassing former state-of-the-art (SOTA) models across five SOD datasets, ExPert achieves 0.215 mean absolute error (MAE) in ECSSD dataset with 80.2M trained parameters, 21% better than transformer-based SOTA model and 47% better than CNN-based SOTA model.
Abstract (translated)
突出物体检测(SOD)旨在在图像和输出中找到最具突出的物体,并输出二进制掩码级像素级。基于Transformer的方法由于其全局语义理解,对于识别突出物体至关重要。然而,这些模型往往较大,并需要大量的训练参数。为了更好地利用Transformer的潜力进行SOD,我们提出了一种新参数高效的微调方法,旨在减少训练参数的数量,同时提高突出物体检测能力。我们的模型被称为External Prompt features Enhanced adapteR Tuning (ExPert),具有编码器-解码器结构,其中适配器模块将预训练的骨干网络调整为SOD,而注入器模块则包含外部提示特征,以增强对突出物体的意识。全面的实验证明了我们方法的优势。在五个SOD数据集上,ExPert超越了最先进的(SOTA)模型。在ECSSD数据集上,ExPert具有80.2M个训练参数,比基于Transformer的SOTA模型好21%,比基于CNN的SOTA模型好47%。
URL
https://arxiv.org/abs/2404.15008