Abstract
The transfer of a neural network (CNN) trained to recognize objects to the task of scene classification is considered. A Bag-of-Semantics (BoS) representation is first induced, by feeding scene image patches to the object CNN, and representing the scene image by the ensuing bag of posterior class probability vectors (semantic posteriors). The encoding of the BoS with a Fisher vector(FV) is then studied. A link is established between the FV of any probabilistic model and the Q-function of the expectation-maximization(EM) algorithm used to estimate its parameters by maximum likelihood. A network implementation of the MFA Fisher Score (MFA-FS), denoted as the MFAFSNet, is finally proposed to enable end-to-end training. Experiments with various object CNNs and datasets show that the approach has state-of-the-art transfer performance. Somewhat surprisingly, the scene classification results are superior to those of a CNN explicitly trained for scene classification, using a large scene dataset (Places). This suggests that holistic analysis is insufficient for scene classification. The modeling of local object semantics appears to be at least equally important. The two approaches are also shown to be strongly complementary, leading to very large scene classification gains when combined, and outperforming all previous scene classification approaches by a sizeable margin
Abstract (translated)
将训练识别对象的神经网络(CNN)转移到场景分类任务中。首先,通过将场景图像块输入到对象CNN中,并通过随后的一组后验概率向量(语义后继向量)来表示场景图像,从而产生一个语义袋(bos)表示。然后研究了用Fisher矢量(FV)对bos的编码。在任意概率模型的fv和期望最大化(em)算法的q-函数之间建立了一个链接,该算法通过最大似然估计其参数。最后,我们提出了一种网络实现的mfa-fisher评分(mfa-fs),即mfafsnet,以实现端到端的培训。对不同对象CNN和数据集的实验表明,该方法具有最先进的传输性能。有点令人惊讶的是,场景分类结果优于CNN明确培训的场景分类结果,使用大型场景数据集(places)。这表明整体分析不足以进行场景分类。本地对象语义的建模看起来至少同样重要。这两种方法也显示出很强的互补性,当结合起来时,会产生很大的场景分类增益,并且在很大程度上优于以前的所有场景分类方法。
URL
https://arxiv.org/abs/1905.11539