Paper Reading AI Learner

One-shot Face Sketch Synthesis in the Wild via Generative Diffusion Prior and Instruction Tuning

2025-06-18 09:41:30
Han Wu, Junyao Li, Kangbo Zhao, Sen Zhang, Yukai Shi, Liang Lin

Abstract

Face sketch synthesis is a technique aimed at converting face photos into sketches. Existing face sketch synthesis research mainly relies on training with numerous photo-sketch sample pairs from existing datasets. However, these large-scale discriminative learning methods will have to face problems such as data scarcity and high human labor costs. Once the training data becomes scarce, their generative performance significantly degrades. In this paper, we propose a one-shot face sketch synthesis method based on diffusion models. We optimize text instructions on a diffusion model using face photo-sketch image pairs. Then, the instructions derived through gradient-based optimization are used for inference. To simulate real-world scenarios more accurately and evaluate method effectiveness more comprehensively, we introduce a new benchmark named One-shot Face Sketch Dataset (OS-Sketch). The benchmark consists of 400 pairs of face photo-sketch images, including sketches with different styles and photos with different backgrounds, ages, sexes, expressions, illumination, etc. For a solid out-of-distribution evaluation, we select only one pair of images for training at each time, with the rest used for inference. Extensive experiments demonstrate that the proposed method can convert various photos into realistic and highly consistent sketches in a one-shot context. Compared to other methods, our approach offers greater convenience and broader applicability. The dataset will be available at: this https URL

Abstract (translated)

面部草图合成是一种将人脸照片转换为素描的技术。现有的面部草图合成研究主要依赖于现有数据集中众多的照片-素描样本对进行训练。然而,这些大规模判别式学习方法面临诸如数据稀缺和高昂的人力成本等问题。一旦训练数据变得稀缺,它们的生成性能就会显著下降。在本文中,我们提出了一种基于扩散模型的一次性面部草图合成方法。我们在一个扩散模型上使用人脸照片-素描图像对来优化文本指令,并通过梯度优化得到的指令用于推理过程。为了更准确地模拟真实场景并全面评估方法的有效性,我们引入了一个新的基准测试集——一次性面部草图数据集(OS-Sketch)。该基准由400对人脸照片-素描图像组成,其中包括风格各异的素描和背景、年龄、性别、表情、光照等不同的照片。为了进行严格的离群评估,我们在每次训练时只选择一对图像进行训练,其余用于推理。大量的实验表明,所提出的方法能够在一次性环境中将各种各样的照片转换为逼真且高度一致的草图。与其它方法相比,我们的方法提供了更大的便利性和更广泛的应用性。数据集将在以下网址提供:this https URL

URL

https://arxiv.org/abs/2506.15312

PDF

https://arxiv.org/pdf/2506.15312.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot