Paper Reading AI Learner

Continual Diffusion with STAMINA: STack-And-Mask INcremental Adapters

2023-11-30 18:04:21
James Seale Smith, Yen-Chang Hsu, Zsolt Kira, Yilin Shen, Hongxia Jin

Abstract

Recent work has demonstrated a remarkable ability to customize text-to-image diffusion models to multiple, fine-grained concepts in a sequential (i.e., continual) manner while only providing a few example images for each concept. This setting is known as continual diffusion. Here, we ask the question: Can we scale these methods to longer concept sequences without forgetting? Although prior work mitigates the forgetting of previously learned concepts, we show that its capacity to learn new tasks reaches saturation over longer sequences. We address this challenge by introducing a novel method, STack-And-Mask INcremental Adapters (STAMINA), which is composed of low-ranked attention-masked adapters and customized MLP tokens. STAMINA is designed to enhance the robust fine-tuning properties of LoRA for sequential concept learning via learnable hard-attention masks parameterized with low rank MLPs, enabling precise, scalable learning via sparse adaptation. Notably, all introduced trainable parameters can be folded back into the model after training, inducing no additional inference parameter costs. We show that STAMINA outperforms the prior SOTA for the setting of text-to-image continual customization on a 50-concept benchmark composed of landmarks and human faces, with no stored replay data. Additionally, we extended our method to the setting of continual learning for image classification, demonstrating that our gains also translate to state-of-the-art performance in this standard benchmark.

Abstract (translated)

近年来,工作展示了将文本到图像扩散模型定制到多个细粒度概念并在序列(即连续)方式下进行定制,而只需为每个概念提供几张示例图像的显著能力。这个设置被称为连续扩散。在这里,我们问一个问题:我们能否将这种方法扩展到更长的概念序列,而不遗忘?尽管先前的研究减轻了之前学习概念的遗忘,但我们证明了在更长的序列中学习新任务的能力已经达到了饱和。为了解决这个问题,我们引入了一种新的方法:STack-And-Mask INcremental Adapters(STAMINA),它由低秩注意力掩码的适应器组成,并定制了自定义的MLP标记。STAMINA旨在通过可学习的高注意度掩码参数化低秩MLP,增强LoRA在序列概念学习中的鲁棒性,实现精确、可扩展的学习,通过稀疏适应。值得注意的是,所有引入的可训练参数在训练后都可以折叠回到模型中,从而不会产生额外的推理参数成本。我们在包含50个概念的文本到图像连续定制设置上评估了STAMINA,发现它在该设置上超过了先前的SOTA。此外,我们还将我们的方法扩展到图像分类的连续学习设置中,证明了我们的收获同样可以应用于这个标准基准。

URL

https://arxiv.org/abs/2311.18763

PDF

https://arxiv.org/pdf/2311.18763.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot