Paper Reading AI Learner

Back To The Drawing Board: Rethinking Scene-Level Sketch-Based Image Retrieval

2025-09-08 11:26:40
Emil Demi\'c, Luka \v{C}ehovin Zajc

Abstract

The goal of Scene-level Sketch-Based Image Retrieval is to retrieve natural images matching the overall semantics and spatial layout of a free-hand sketch. Unlike prior work focused on architectural augmentations of retrieval models, we emphasize the inherent ambiguity and noise present in real-world sketches. This insight motivates a training objective that is explicitly designed to be robust to sketch variability. We show that with an appropriate combination of pre-training, encoder architecture, and loss formulation, it is possible to achieve state-of-the-art performance without the introduction of additional complexity. Extensive experiments on a challenging FS-COCO and widely-used SketchyCOCO datasets confirm the effectiveness of our approach and underline the critical role of training design in cross-modal retrieval tasks, as well as the need to improve the evaluation scenarios of scene-level SBIR.

Abstract (translated)

场景级基于草图的图像检索(Scene-level Sketch-Based Image Retrieval,SBIR)的目标是从自然图像中检索出与手绘草图的整体语义和空间布局相匹配的图片。不同于以往专注于检索模型架构增强的工作,我们强调了真实世界草图中存在的固有模糊性和噪声问题。这一见解促使我们设计了一种训练目标,该目标明确地旨在抵御草图变化带来的影响。 通过适当结合预训练、编码器架构以及损失函数的制定,我们可以实现最先进的性能,并且不需要引入额外的复杂性。在具有挑战性的FS-COCO和广泛使用的SketchyCOCO数据集上进行的大量实验验证了我们方法的有效性,并强调了跨模态检索任务中训练设计的关键作用,同时也指出了改进场景级SBIR评估方案的需求。

URL

https://arxiv.org/abs/2509.06566

PDF

https://arxiv.org/pdf/2509.06566.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot