Paper Reading AI Learner

Semantic Adversarial Network for Zero-Shot Sketch-Based Image Retrieval

2019-05-07 02:20:42
Xinxun Xu, Hao Wang, Leida Li, Cheng Deng

Abstract

Zero-shot sketch-based image retrieval (ZS-SBIR) is a specific cross-modal retrieval task for retrieving natural images with free-hand sketches under zero-shot scenario. Previous works mostly focus on modeling the correspondence between images and sketches or synthesizing image features with sketch features. However, both of them ignore the large intra-class variance of sketches, thus resulting in unsatisfactory retrieval performance. In this paper, we propose a novel end-to-end semantic adversarial approach for ZS-SBIR. Specifically, we devise a semantic adversarial module to maximize the consistency between learned semantic features and category-level word vectors. Moreover, to preserve the discriminability of synthesized features within each training category, a triplet loss is employed for the generative module. Additionally, the proposed model is trained in an end-to-end strategy to exploit better semantic features suitable for ZS-SBIR. Extensive experiments conducted on two large-scale popular datasets demonstrate that our proposed approach remarkably outperforms state-of-the-art approaches by more than 12\% on Sketchy dataset and about 3\% on TU-Berlin dataset in the retrieval.

Abstract (translated)

基于零镜头草图的图像检索(zs-sbir)是一种特殊的跨模式检索任务,用于在零镜头场景下用自由手绘图检索自然图像。以往的研究主要集中在图像和草图之间的对应关系建模,或者用草图特征合成图像特征。然而,二者都忽略了草图的大类内方差,导致检索性能不理想。本文提出了一种新颖的zs-sbir端到端语义对抗方法。具体来说,我们设计了一个语义对抗模块,以最大限度地提高所学语义特征和类别级词汇向量之间的一致性。此外,为了在每个训练类别中保持合成特征的可辨别性,生成模块采用三重损失。此外,该模型还接受了端到端策略的训练,以利用适合于zs-sbir的更好的语义特征。在两个大型流行数据集上进行的大量实验表明,我们提出的方法在粗略数据集上显著优于最先进的方法12%以上,在检索中在图柏林数据集上显著优于最先进的方法约3%。

URL

https://arxiv.org/abs/1905.02327

PDF

https://arxiv.org/pdf/1905.02327.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot