Abstract
We propose a novel framework for ID-preserving generation using a multi-modal encoding strategy rather than injecting identity features via adapters into pre-trained models. Our method treats identity and text as a unified conditioning input. To achieve this, we introduce FaceCLIP, a multi-modal encoder that learns a joint embedding space for both identity and textual semantics. Given a reference face and a text prompt, FaceCLIP produces a unified representation that encodes both identity and text, which conditions a base diffusion model to generate images that are identity-consistent and text-aligned. We also present a multi-modal alignment algorithm to train FaceCLIP, using a loss that aligns its joint representation with face, text, and image embedding spaces. We then build FaceCLIP-SDXL, an ID-preserving image synthesis pipeline by integrating FaceCLIP with Stable Diffusion XL (SDXL). Compared to prior methods, FaceCLIP-SDXL enables photorealistic portrait generation with better identity preservation and textual relevance. Extensive experiments demonstrate its quantitative and qualitative superiority.
Abstract (translated)
我们提出了一种新颖的框架,用于通过多模态编码策略而非使用适配器向预训练模型注入身份特征来进行保持身份特性的生成。我们的方法将身份和文本视为统一的条件输入。为此,我们引入了FaceCLIP,这是一种多模态编码器,能够为身份和文本语义学习联合嵌入空间。给定一个参考人脸图像和一段文字提示,FaceCLIP可以产生一种同时包含身份信息和文本内容的统一表示形式,这种表示形式能条件化基础扩散模型以生成既符合身份又与文本相关的图像。此外,我们还提出了一种多模态对齐算法来训练FaceCLIP,该算法使用一种损失函数将其联合表示与人脸、文本及图像嵌入空间进行对齐。接着,我们将FaceCLIP与Stable Diffusion XL (SDXL)集成起来构建了FaceCLIP-SDXL,这是一种保持身份特性的图像合成流水线。相比之前的方法,FaceCLIP-SDXL能够生成更逼真的肖像图片,并且在身份保存和文本相关性方面表现更好。广泛的实验表明其具有定量和定性的优越性能。
URL
https://arxiv.org/abs/2504.14202