Paper Reading AI Learner

LinFusion: 1 GPU, 1 Minute, 16K Image

2024-09-03 17:54:39
Songhua Liu, Weihao Yu, Zhenxiong Tan, Xinchao Wang

Abstract

Modern diffusion models, particularly those utilizing a Transformer-based UNet for denoising, rely heavily on self-attention operations to manage complex spatial relationships, thus achieving impressive generation performance. However, this existing paradigm faces significant challenges in generating high-resolution visual content due to its quadratic time and memory complexity with respect to the number of spatial tokens. To address this limitation, we aim at a novel linear attention mechanism as an alternative in this paper. Specifically, we begin our exploration from recently introduced models with linear complexity, e.g., Mamba, Mamba2, and Gated Linear Attention, and identify two key features-attention normalization and non-causal inference-that enhance high-resolution visual generation performance. Building on these insights, we introduce a generalized linear attention paradigm, which serves as a low-rank approximation of a wide spectrum of popular linear token mixers. To save the training cost and better leverage pre-trained models, we initialize our models and distill the knowledge from pre-trained StableDiffusion (SD). We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD after only modest training, while significantly reducing time and memory complexity. Extensive experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion delivers satisfactory zero-shot cross-resolution generation performance, generating high-resolution images like 16K resolution. Moreover, it is highly compatible with pre-trained SD components, such as ControlNet and IP-Adapter, requiring no adaptation efforts. Codes are available at this https URL.

Abstract (translated)

现代扩散模型,尤其是那些基于Transformer的UNet去噪模型,在很大程度上依赖于自注意力操作来管理复杂的空间关系,从而实现了令人印象深刻的生成性能。然而,与空间令牌数有关的平方时间和内存复杂性使得这种现有范式在生成高分辨率视觉内容方面面临重大挑战。为了应对这一局限,本文提出了一种新颖的线性注意力机制作为替代方案。具体来说,我们从最近引入的具有线性复杂性的模型开始探索,例如Mamba、Mamba2和Gated Linear Attention,并确定两个关键特征-注意力归一化和非因果推理,这些特征增强了高分辨率视觉生成性能。在此基础上,我们引入了一个通用的线性注意力范例,作为广泛流行线性令牌混合器的低秩近似。为了节省训练成本并更好地利用预训练模型,我们初始化我们的模型并从预训练的StableDiffusion(SD)中提取知识。我们发现,经过仅仅适中的训练后,经过蒸馏得到的模型,称为LinFusion,在性能上与原始SD相当或者更优秀,同时显著减少了时间和内存复杂性。在SD-v1.5、SD-v2.1和SD-XL的广泛实验中,LinFusion实现了满意的零散分辨率生成性能,生成了类似于16K分辨率的高分辨率图像。此外,它与预训练SD组件(如ControlNet和IP-Adapter)高度兼容,无需进行适应性努力。代码可在此链接处获取。

URL

https://arxiv.org/abs/2409.02097

PDF

https://arxiv.org/pdf/2409.02097.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot