Abstract
Controllable text-to-image (T2I) diffusion models generate images conditioned on both text prompts and semantic inputs of other modalities like edge maps. Nevertheless, current controllable T2I methods commonly face challenges related to efficiency and faithfulness, especially when conditioning on multiple inputs from either the same or diverse modalities. In this paper, we propose a novel Flexible and Efficient method, FlexEControl, for controllable T2I generation. At the core of FlexEControl is a unique weight decomposition strategy, which allows for streamlined integration of various input types. This approach not only enhances the faithfulness of the generated image to the control, but also significantly reduces the computational overhead typically associated with multimodal conditioning. Our approach achieves a reduction of 41% in trainable parameters and 30% in memory usage compared with Uni-ControlNet. Moreover, it doubles data efficiency and can flexibly generate images under the guidance of multiple input conditions of various modalities.
Abstract (translated)
可控文本到图像(T2I)扩散模型根据多种模态(如边缘图)的语义输入生成图像。然而,当前的可控T2I方法通常面临效率和忠实性方面的挑战,尤其是在对相同或多样模态进行条件时。在本文中,我们提出了一个新方法FlexEControl,用于可控T2I生成。FlexEControl的核心是一种独特的权重分解策略,允许集成各种输入类型的简化整合。这种方法不仅增强了生成图像与控制之间的忠实性,而且显著减少了通常与多模态条件相关的计算开销。与Uni-ControlNet相比,我们的方法训练参数减少了41%,内存使用了减少了30%。此外,它还提高了数据效率,并能在各种模态的输入条件下进行灵活生成图像。
URL
https://arxiv.org/abs/2405.04834