Paper Reading AI Learner

Combination of Multiple Global Descriptors for Image Retrieval

2019-03-26 03:38:38
HeeJae Jun, ByungSoo Ko, Youngjoon Kim, Insik Kim, Jongtack Kim

Abstract

Recent studies in image retrieval task have shown that ensembling different models and combining multiple global descriptors lead to performance improvement. However, training different models for ensemble is not only difficult but also inefficient with respect to time or memory. In this paper, we propose a novel framework that exploits multiple global descriptors to get an ensemble-like effect while it can be trained in an end-to-end manner. The proposed framework is flexible and expandable by the global descriptor, CNN backbone, loss, and dataset. Moreover, we investigate the effectiveness of combining multiple global descriptors with quantitative and qualitative analysis. Our extensive experiments show that the combined descriptor outperforms a single global descriptor, as it can utilize different types of feature properties. In the benchmark evaluation, the proposed framework achieves the state-of-the-art performance on the CARS196, CUB200-2011, In-shop Clothes and Stanford Online Products on image retrieval tasks by a large margin compared to competing approaches.

Abstract (translated)

最近在图像检索任务中的研究表明,将不同的模型组合在一起,并结合多个全局描述符,可以提高性能。然而,训练不同的合奏模式不仅困难,而且在时间或记忆方面效率低下。在本文中,我们提出了一个新的框架,它利用多个全局描述符来获得一个类似于集合的效果,同时可以对其进行端到端的训练。该框架具有灵活性和可扩展性,包括全局描述符、CNN主干网、丢失和数据集。此外,我们还研究了多个全局描述符与定量和定性分析相结合的有效性。我们的大量实验表明,组合描述符优于单个全局描述符,因为它可以利用不同类型的特征属性。在基准评估中,与竞争性方法相比,该框架在CAR196、CUB200-2011、店内服装和斯坦福在线产品上的图像检索任务上实现了最先进的性能。

URL

https://arxiv.org/abs/1903.10663

PDF

https://arxiv.org/pdf/1903.10663.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot