Paper Reading AI Learner

Stylus: Automatic Adapter Selection for Diffusion Models

2024-04-29 17:59:16
Michael Luo, Justin Wong, Brandon Trabucco, Yanping Huang, Joseph E. Gonzalez, Zhifeng Chen, Ruslan Salakhutdinov, Ion Stoica

Abstract

Beyond scaling base models with more data or parameters, fine-tuned adapters provide an alternative way to generate high fidelity, custom images at reduced costs. As such, adapters have been widely adopted by open-source communities, accumulating a database of over 100K adapters-most of which are highly customized with insufficient descriptions. This paper explores the problem of matching the prompt to a set of relevant adapters, built on recent work that highlight the performance gains of composing adapters. We introduce Stylus, which efficiently selects and automatically composes task-specific adapters based on a prompt's keywords. Stylus outlines a three-stage approach that first summarizes adapters with improved descriptions and embeddings, retrieves relevant adapters, and then further assembles adapters based on prompts' keywords by checking how well they fit the prompt. To evaluate Stylus, we developed StylusDocs, a curated dataset featuring 75K adapters with pre-computed adapter embeddings. In our evaluation on popular Stable Diffusion checkpoints, Stylus achieves greater CLIP-FID Pareto efficiency and is twice as preferred, with humans and multimodal models as evaluators, over the base model. See this http URL for more.

Abstract (translated)

除了通过增加数据或参数来扩展基础模型,微调的适配器提供了一种以较低成本生成高保真度、定制图像的替代方法。因此,适配器已被广泛应用于开源社区,累积了一个超过100K个适配器的数据库,其中大多数都是高度自定义且描述不足的。本文探讨了将提示与一组相关适配器相匹配的问题,这是基于最近的工作,该工作强调了组合适配器的性能提升。我们引入了Stylus,它根据提示的关键词高效地选择并自动组合任务特定的适配器。Stylus概述了一个三阶段的方法,首先总结具有更好描述和嵌入的适配器,检索相关的适配器,然后根据提示的关键词进一步组装适配器,通过检查它们是否符合提示来检查它们。为了评估Stylus,我们开发了StylusDocs,一个包含75K个预计算嵌入的适配器的 curated数据集。在我们的对流行Stable Diffusion检查点的评估中,Stylus实现了CLIP-FID Pareto效率的更大提高,是基础模型的两倍受欢迎程度,人类和多模态模型作为评估者,超过基础模型。更多内容,请访问此链接:http://www.example.com/StylusDocs。

URL

https://arxiv.org/abs/2404.18928

PDF

https://arxiv.org/pdf/2404.18928.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot