Abstract
Modern diffusion models, particularly those utilizing a Transformer-based UNet for denoising, rely heavily on self-attention operations to manage complex spatial relationships, thus achieving impressive generation performance. However, this existing paradigm faces significant challenges in generating high-resolution visual content due to its quadratic time and memory complexity with respect to the number of spatial tokens. To address this limitation, we aim at a novel linear attention mechanism as an alternative in this paper. Specifically, we begin our exploration from recently introduced models with linear complexity, e.g., Mamba, Mamba2, and Gated Linear Attention, and identify two key features-attention normalization and non-causal inference-that enhance high-resolution visual generation performance. Building on these insights, we introduce a generalized linear attention paradigm, which serves as a low-rank approximation of a wide spectrum of popular linear token mixers. To save the training cost and better leverage pre-trained models, we initialize our models and distill the knowledge from pre-trained StableDiffusion (SD). We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD after only modest training, while significantly reducing time and memory complexity. Extensive experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion delivers satisfactory zero-shot cross-resolution generation performance, generating high-resolution images like 16K resolution. Moreover, it is highly compatible with pre-trained SD components, such as ControlNet and IP-Adapter, requiring no adaptation efforts. Codes are available at this https URL.
Abstract (translated)
现代扩散模型,尤其是那些基于Transformer的UNet去噪模型,在很大程度上依赖于自注意力操作来管理复杂的空间关系,从而实现了令人印象深刻的生成性能。然而,与空间令牌数有关的平方时间和内存复杂性使得这种现有范式在生成高分辨率视觉内容方面面临重大挑战。为了应对这一局限,本文提出了一种新颖的线性注意力机制作为替代方案。具体来说,我们从最近引入的具有线性复杂性的模型开始探索,例如Mamba、Mamba2和Gated Linear Attention,并确定两个关键特征-注意力归一化和非因果推理,这些特征增强了高分辨率视觉生成性能。在此基础上,我们引入了一个通用的线性注意力范例,作为广泛流行线性令牌混合器的低秩近似。为了节省训练成本并更好地利用预训练模型,我们初始化我们的模型并从预训练的StableDiffusion(SD)中提取知识。我们发现,经过仅仅适中的训练后,经过蒸馏得到的模型,称为LinFusion,在性能上与原始SD相当或者更优秀,同时显著减少了时间和内存复杂性。在SD-v1.5、SD-v2.1和SD-XL的广泛实验中,LinFusion实现了满意的零散分辨率生成性能,生成了类似于16K分辨率的高分辨率图像。此外,它与预训练SD组件(如ControlNet和IP-Adapter)高度兼容,无需进行适应性努力。代码可在此链接处获取。
URL
https://arxiv.org/abs/2409.02097