Paper Reading AI Learner

SketchTriplet: Self-Supervised Scenarized Sketch-Text-Image Triplet Generation

2024-05-29 06:43:49
Zhenbei Wu, Qiang Wang, Jie Yang


The scarcity of free-hand sketch presents a challenging problem. Despite the emergence of some large-scale sketch datasets, these datasets primarily consist of sketches at the single-object level. There continues to be a lack of large-scale paired datasets for scene sketches. In this paper, we propose a self-supervised method for scene sketch generation that does not rely on any existing scene sketch, enabling the transformation of single-object sketches into scene sketches. To accomplish this, we introduce a method for vector sketch captioning and sketch semantic expansion. Additionally, we design a sketch generation network that incorporates a fusion of multi-modal perceptual constraints, suitable for application in zero-shot image-to-sketch downstream task, demonstrating state-of-the-art performance through experimental validation. Finally, leveraging our proposed sketch-to-sketch generation method, we contribute a large-scale dataset centered around scene sketches, comprising highly semantically consistent "text-sketch-image" triplets. Our research confirms that this dataset can significantly enhance the capabilities of existing models in sketch-based image retrieval and sketch-controlled image synthesis tasks. We will make our dataset and code publicly available.

Abstract (translated)

自由手绘图的 scarcity 呈现了一个具有挑战性的问题。尽管出现了一些大规模的手绘图数据集,但这些数据集主要是由单个物体级别的手绘图组成的。仍然缺乏大规模的成对场景手绘图数据。在本文中,我们提出了一种自监督的场景手绘图生成方法,不依赖于任何现有的场景手绘图,可以将单个物体手绘图转换为场景手绘图。为了实现这一目标,我们引入了向量手绘图注释和手绘图语义扩展的方法。此外,我们还设计了一个包含多种模态感知约束的草图生成网络,适用于应用于零散图像到手绘图的下游任务,通过实验验证证明了最先进的表现。最后,利用我们提出的从手绘图到手绘图的生成方法,我们贡献了一个围绕场景手绘图的大规模数据集,包括高度语义一致的“文本手绘图像”三元组。我们的研究证实,这个数据集可以显著增强现有模型在基于手绘图的图像检索和手绘图控制的图像合成任务中的能力。我们将把数据集和代码公开发布。



3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot