Paper Reading AI Learner

Semantically Tied Paired Cycle Consistency for Zero-Shot Sketch-based Image Retrieval

2019-03-08 11:20:39
Anjan Dutta, Zeynep Akata

Abstract

Zero-shot sketch-based image retrieval (SBIR) is an emerging task in computer vision, allowing to retrieve natural images relevant to sketch queries that might not been seen in the training phase. Existing works either require aligned sketch-image pairs or inefficient memory fusion layer for mapping the visual information to a semantic space. In this work, we propose a semantically aligned paired cycle-consistent generative (SEM-PCYC) model for zero-shot SBIR, where each branch maps the visual information to a common semantic space via an adversarial training. Each of these branches maintains a cycle consistency that only requires supervision at category levels, and avoids the need of highly-priced aligned sketch-image pairs. A classification criteria on the generators' outputs ensures the visual to semantic space mapping to be discriminating. Furthermore, we propose to combine textual and hierarchical side information via a feature selection auto-encoder that selects discriminating side information within a same end-to-end model. Our results demonstrate a significant boost in zero-shot SBIR performance over the state-of-the-art on the challenging Sketchy and TU-Berlin datasets.

Abstract (translated)

基于零镜头草图的图像检索(SBIR)是计算机视觉中的一项新兴任务,它允许检索与草图查询相关的自然图像,而这些草图查询可能在培训阶段未被发现。现有的作品要么需要对齐的草图图像对,要么需要低效的内存融合层来将视觉信息映射到语义空间。在这项工作中,我们提出了一个零镜头SBIR的语义对齐的成对循环一致生成(sem-pcyc)模型,其中每个分支通过对抗训练将视觉信息映射到一个公共语义空间。这些分支中的每一个都保持了一个周期一致性,只需要在类别级别上进行监督,并且避免了高定价的对齐草图图像对的需要。生成器输出的分类标准确保视觉到语义空间的映射是有区别的。此外,我们还建议通过一个特征选择自动编码器将文本和层次化的侧信息结合起来,该编码器在相同的端到端模型中选择识别的侧信息。我们的结果表明,在具有挑战性的草图和Tu-Berlin数据集上,零镜头SBIR的性能比最新技术有了显著提高。

URL

https://arxiv.org/abs/1903.03372

PDF

https://arxiv.org/pdf/1903.03372.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot