Abstract
Recent advancements in personalizing text-to-image (T2I) diffusion models have shown the capability to generate images based on personalized visual concepts using a limited number of user-provided examples. However, these models often struggle with maintaining high visual fidelity, particularly in manipulating scenes as defined by textual inputs. Addressing this, we introduce ComFusion, a novel approach that leverages pretrained models generating composition of a few user-provided subject images and predefined-text scenes, effectively fusing visual-subject instances with textual-specific scenes, resulting in the generation of high-fidelity instances within diverse scenes. ComFusion integrates a class-scene prior preservation regularization, which leverages composites the subject class and scene-specific knowledge from pretrained models to enhance generation fidelity. Additionally, ComFusion uses coarse generated images, ensuring they align effectively with both the instance image and scene texts. Consequently, ComFusion maintains a delicate balance between capturing the essence of the subject and maintaining scene fidelity.Extensive evaluations of ComFusion against various baselines in T2I personalization have demonstrated its qualitative and quantitative superiority.
Abstract (translated)
近年来在个性化文本-图像(T2I)扩散模型方面的进步表明,这些模型能够使用有限的用户提供的示例生成基于个性化视觉概念的图像。然而,这些模型通常在操作由文本输入定义的场景时陷入困境,尤其是在处理场景时。为了解决这个问题,我们引入了ComFusion,一种新方法,它利用预训练模型生成几张用户提供的主题图像和预定义文本场景的组合,有效地将视觉主题实例与文本特定的场景融合在一起,从而在多样场景中生成高保真度的实例。ComFusion实现了一个类场景先验保留的规范化,它利用预训练模型的组合来增强生成保真度。此外,ComFusion使用粗生成图像,确保它们与实例图像和场景文本 alignment 有效。因此,ComFusion在捕捉主题本质的同时保持了场景保真度。 通过对ComFusion与各种T2I个性化基准的广泛评估,证明了ComFusion在质量和数量上具有优越性。
URL
https://arxiv.org/abs/2402.11849