Abstract
Recently, data-driven deep saliency models have achieved high performance and have outperformed classical saliency models, as demonstrated by results on datasets such as the MIT300 and SALICON. Yet, there remains a large gap between the performance of these models and the inter-human baseline. Some outstanding questions include what have these models learned, how and where they fail, and how they can be improved. This article attempts to answer these questions by analyzing the representations learned by individual neurons located at the intermediate layers of deep saliency models. To this end, we follow the steps of existing deep saliency models, that is borrowing a pre-trained model of object recognition to encode the visual features and learning a decoder to infer the saliency. We consider two cases when the encoder is used as a fixed feature extractor and when it is fine-tuned, and compare the inner representations of the network. To study how the learned representations depend on the task, we fine-tune the same network using the same image set but for two different tasks: saliency prediction versus scene classification. Our analyses reveal that: 1) some visual regions (e.g. head, text, symbol, vehicle) are already encoded within various layers of the network pre-trained for object recognition, 2) using modern datasets, we find that fine-tuning pre-trained models for saliency prediction makes them favor some categories (e.g. head) over some others (e.g. text), 3) although deep models of saliency outperform classical models on natural images, the converse is true for synthetic stimuli (e.g. pop-out search arrays), an evidence of significant difference between human and data-driven saliency models, and 4) we confirm that, after-fine tuning, the change in inner-representations is mostly due to the task and not the domain shift in the data.
Abstract (translated)
最近,数据驱动的深度显著性模型已经取得了很高的性能,并且已经超过了经典的显著性模型,如在数据集上的结果,如mit300和salion。然而,在这些模型的性能和人与人之间的基线之间仍然存在很大的差距。一些悬而未决的问题包括这些模型学到了什么,它们如何失败,在哪里失败,以及如何改进。本文试图通过分析位于深显著性模型中间层的单个神经元所学的表征来回答这些问题。为此,我们遵循现有的深度显著性模型的步骤,即借用预先训练的对象识别模型对视觉特征进行编码,学习解码器来推断显著性。我们考虑了编码器作为固定特征抽取器和微调时的两种情况,并比较了网络的内部表示。为了研究学习的表示如何依赖于任务,我们使用相同的图像集微调相同的网络,但针对两个不同的任务:显著性预测与场景分类。我们的分析表明:1)一些视觉区域(例如头部、文本、符号、车辆)已经编码在网络的不同层中,预先训练用于对象识别;2)使用现代数据集,我们发现针对显著性预测的微调预训练模型使它们比某些类别(例如头部)更受青睐(例如文本);3)虽然在自然图像上,显著性的深层模型优于经典模型,但在合成刺激(例如弹出搜索数组)中相反,这是人类和数据驱动显著性模型之间显著差异的一个证据;4)我们确认,经过微调后,内部表示的变化主要是由于任务而不是t数据中的域移位。
URL
https://arxiv.org/abs/1903.02501