Paper Reading AI Learner

KnobGen: Controlling the Sophistication of Artwork in Sketch-Based Diffusion Models

2024-10-02 14:33:12
Pouyan Navard, Amin Karimi Monsefi, Mengxi Zhou, Wei-Lun Chao, Alper Yilmaz, Rajiv Ramnath

Abstract

Recent advances in diffusion models have significantly improved text-to-image (T2I) generation, but they often struggle to balance fine-grained precision with high-level control. Methods like ControlNet and T2I-Adapter excel at following sketches by seasoned artists but tend to be overly rigid, replicating unintentional flaws in sketches from novice users. Meanwhile, coarse-grained methods, such as sketch-based abstraction frameworks, offer more accessible input handling but lack the precise control needed for detailed, professional use. To address these limitations, we propose KnobGen, a dual-pathway framework that democratizes sketch-based image generation by seamlessly adapting to varying levels of sketch complexity and user skill. KnobGen uses a Coarse-Grained Controller (CGC) module for high-level semantics and a Fine-Grained Controller (FGC) module for detailed refinement. The relative strength of these two modules can be adjusted through our knob inference mechanism to align with the user's specific needs. These mechanisms ensure that KnobGen can flexibly generate images from both novice sketches and those drawn by seasoned artists. This maintains control over the final output while preserving the natural appearance of the image, as evidenced on the MultiGen-20M dataset and a newly collected sketch dataset.

Abstract (translated)

近年来,扩散模型的进步显著提高了文本到图像(T2I)生成,但它们通常在提高细粒度精度的同时,难以平衡高级控制。像ControlNet和T2I-Adapter这样的方法在遵循成熟艺术家的轮廓方面表现出色,但往往过于刚性,重复新手用户的粗略轮廓中的无意缺陷。与此同时,粗粒度方法,如基于轮廓的抽象框架,提供了一种更易接近的输入处理方式,但缺乏用于详细、专业使用的精确控制。为了克服这些局限性,我们提出了KnobGen,一种双路径方法,通过平滑地适应各种轮廓复杂度和用户技能水平,将基于轮廓的图像生成民主化。KnobGen使用粗粒度控制器(CGC)模块和高精度控制器(FGC)模块。这些模块的相对强度可以通过我们的开关推理机制进行调整,以满足用户的具体需求。这些机制确保KnobGen可以从新手用户的轮廓和资深艺术家的作品中灵活生成图像。这可以在MultiGen-20M数据集和一个新的手绘轮廓数据集上得到保持控制的同时保持图像的自然外观。

URL

https://arxiv.org/abs/2410.01595

PDF

https://arxiv.org/pdf/2410.01595.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot