Paper Reading AI Learner

Semantic Fisher Scores for Task Transfer: Using Objects to Classify Scenes

2019-05-27 23:15:26
Mandar Dixit, Yunsheng Li, Nuno Vasconcelos

Abstract

The transfer of a neural network (CNN) trained to recognize objects to the task of scene classification is considered. A Bag-of-Semantics (BoS) representation is first induced, by feeding scene image patches to the object CNN, and representing the scene image by the ensuing bag of posterior class probability vectors (semantic posteriors). The encoding of the BoS with a Fisher vector(FV) is then studied. A link is established between the FV of any probabilistic model and the Q-function of the expectation-maximization(EM) algorithm used to estimate its parameters by maximum likelihood. A network implementation of the MFA Fisher Score (MFA-FS), denoted as the MFAFSNet, is finally proposed to enable end-to-end training. Experiments with various object CNNs and datasets show that the approach has state-of-the-art transfer performance. Somewhat surprisingly, the scene classification results are superior to those of a CNN explicitly trained for scene classification, using a large scene dataset (Places). This suggests that holistic analysis is insufficient for scene classification. The modeling of local object semantics appears to be at least equally important. The two approaches are also shown to be strongly complementary, leading to very large scene classification gains when combined, and outperforming all previous scene classification approaches by a sizeable margin

Abstract (translated)

将训练识别对象的神经网络(CNN)转移到场景分类任务中。首先,通过将场景图像块输入到对象CNN中,并通过随后的一组后验概率向量(语义后继向量)来表示场景图像,从而产生一个语义袋(bos)表示。然后研究了用Fisher矢量(FV)对bos的编码。在任意概率模型的fv和期望最大化(em)算法的q-函数之间建立了一个链接,该算法通过最大似然估计其参数。最后,我们提出了一种网络实现的mfa-fisher评分(mfa-fs),即mfafsnet,以实现端到端的培训。对不同对象CNN和数据集的实验表明,该方法具有最先进的传输性能。有点令人惊讶的是,场景分类结果优于CNN明确培训的场景分类结果,使用大型场景数据集(places)。这表明整体分析不足以进行场景分类。本地对象语义的建模看起来至少同样重要。这两种方法也显示出很强的互补性,当结合起来时,会产生很大的场景分类增益,并且在很大程度上优于以前的所有场景分类方法。

URL

https://arxiv.org/abs/1905.11539

PDF

https://arxiv.org/pdf/1905.11539.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot