Doodle to Search: Practical Zero-Shot Sketch-based Image Retrieval

Abstract
Abstract (translated)
URL
PDF

Abstract

In this paper, we investigate the problem of zero-shot sketch-based image retrieval (ZS-SBIR), where human sketches are used as queries to conduct retrieval of photos from unseen categories. We importantly advance prior arts by proposing a novel ZS-SBIR scenario that represents a firm step forward in its practical application. The new setting uniquely recognizes two important yet often neglected challenges of practical ZS-SBIR, (i) the large domain gap between amateur sketch and photo, and (ii) the necessity for moving towards large-scale retrieval. We first contribute to the community a novel ZS-SBIR dataset, QuickDraw-Extended, that consists of 330,000 sketches and 204,000 photos spanning across 110 categories. Highly abstract amateur human sketches are purposefully sourced to maximize the domain gap, instead of ones included in existing datasets that can often be semi-photorealistic. We then formulate a ZS-SBIR framework to jointly model sketches and photos into a common embedding space. A novel strategy to mine the mutual information among domains is specifically engineered to alleviate the domain gap. External semantic knowledge is further embedded to aid semantic transfer. We show that, rather surprisingly, retrieval performance significantly outperforms that of state-of-the-art on existing datasets that can already be achieved using a reduced version of our model. We further demonstrate the superior performance of our full model by comparing with a number of alternatives on the newly proposed dataset. The new dataset, plus all training and testing code of our model, will be publicly released to facilitate future research

Abstract (translated)

本文研究了基于零镜头素描的图像检索问题（zs-sbir），即用人的素描作为查询，对未知类别的照片进行检索。我们重要的是，通过提出一种新颖的zs-sbir方案来推进现有技术，该方案代表着在实际应用中迈出了坚实的一步。新的设置独特地认识到实际zs-sbir的两个重要但往往被忽视的挑战，（i）业余素描和照片之间的巨大领域差距，以及（ii）向大规模检索迈进的必要性。我们首先为社区贡献了一个新颖的zs-sbir数据集，QuickDraw Extended，它包括33万个草图和20.4万张横跨110个类别的照片。高度抽象的业余人体素描是有目的地来源于最大限度的领域差距，而不是包括在现有的数据集中，往往是半照片现实。然后，我们设计了一个zs-sbir框架，将草图和照片联合建模到一个公共嵌入空间中。提出了一种新的域间互信息挖掘策略，以缓解域间的差异。进一步嵌入外部语义知识，促进语义传递。我们发现，在现有的数据集上，检索性能显著优于最先进的检索性能，而这些数据集已经可以使用我们的模型的简化版本来实现。通过与新提出的数据集上的许多备选方案进行比较，我们进一步证明了完整模型的优越性能。新的数据集，加上我们模型的所有培训和测试代码，将公开发布，以便于将来的研究。

URL

https://arxiv.org/abs/1904.03451

PDF

https://arxiv.org/pdf/1904.03451.pdf