Paper Reading AI Learner

CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not

2023-03-23 17:02:00
Aneeshan Sain, Ayan Kumar Bhunia, Pinaki Nath Chowdhury, Subhadeep Koley, Tao Xiang, Yi-Zhe Song

Abstract

In this paper, we leverage CLIP for zero-shot sketch based image retrieval (ZS-SBIR). We are largely inspired by recent advances on foundation models and the unparalleled generalisation ability they seem to offer, but for the first time tailor it to benefit the sketch community. We put forward novel designs on how best to achieve this synergy, for both the category setting and the fine-grained setting ("all"). At the very core of our solution is a prompt learning setup. First we show just via factoring in sketch-specific prompts, we already have a category-level ZS-SBIR system that overshoots all prior arts, by a large margin (24.8%) - a great testimony on studying the CLIP and ZS-SBIR synergy. Moving onto the fine-grained setup is however trickier, and requires a deeper dive into this synergy. For that, we come up with two specific designs to tackle the fine-grained matching nature of the problem: (i) an additional regularisation loss to ensure the relative separation between sketches and photos is uniform across categories, which is not the case for the gold standard standalone triplet loss, and (ii) a clever patch shuffling technique to help establishing instance-level structural correspondences between sketch-photo pairs. With these designs, we again observe significant performance gains in the region of 26.9% over previous state-of-the-art. The take-home message, if any, is the proposed CLIP and prompt learning paradigm carries great promise in tackling other sketch-related tasks (not limited to ZS-SBIR) where data scarcity remains a great challenge. Code and models will be made available.

Abstract (translated)

在本文中,我们利用CLIP实现零次请求的 Sketch 图像检索(ZS-SBIR),我们主要受到最近在基础模型方面取得的进步以及它们似乎提供的无与伦比的泛化能力启发,但首次为 Sketch 社区服务。我们提出了新的设计,以最大程度地实现这一协同作用,无论是按类别设置还是精细设置(“所有”)。我们的核心解决方案是prompt learning setup。我们首先通过考虑 Sketch 特定的提示因子,已经有了一个按类别设置的 ZS-SBIR 系统,比所有先前作品都超出了很大的比例(24.8%),这是研究 CLIP 和 ZS-SBIR协同作用的巨大证明。然而,切换到精细设置变得更加困难,需要更深入地探索这一协同作用。为此,我们提出了两个特定的设计,以解决精细匹配问题:(i)额外的正则化损失,以确保 Sketch 和照片之间的相对分离在所有类别上是均匀的,而不像标准单例差分损失那样,(ii)聪明的 patch shuffle 技术,以帮助建立 Sketch 和照片之间的实例级结构对应关系。通过这些设计,我们再次观察到在先前技术水平的26.9%范围内显著的性能提升。总之,任何消息都是关于 proposed CLIP 和 prompt learning 范式在处理其他 Sketch 相关任务(不仅仅限于 ZS-SBIR)时具有巨大的潜力,数据稀缺仍然是一个巨大挑战。代码和模型将公开提供。

URL

https://arxiv.org/abs/2303.13440

PDF

https://arxiv.org/pdf/2303.13440.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot