Abstract
Recently, personalized portrait generation with a text-to-image diffusion model has significantly advanced with Textual Inversion, emerging as a promising approach for creating high-fidelity personalized images. Despite its potential, current Textual Inversion methods struggle to maintain consistent facial identity due to semantic misalignments between textual and visual embedding spaces regarding identity. We introduce ID-EA, a novel framework that guides text embeddings to align with visual identity embeddings, thereby improving identity preservation in a personalized generation. ID-EA comprises two key components: the ID-driven Enhancer (ID-Enhancer) and the ID-conditioned Adapter (ID-Adapter). First, the ID-Enhancer integrates identity embeddings with a textual ID anchor, refining visual identity embeddings derived from a face recognition model using representative text embeddings. Then, the ID-Adapter leverages the identity-enhanced embedding to adapt the text condition, ensuring identity preservation by adjusting the cross-attention module in the pre-trained UNet model. This process encourages the text features to find the most related visual clues across the foreground snippets. Extensive quantitative and qualitative evaluations demonstrate that ID-EA substantially outperforms state-of-the-art methods in identity preservation metrics while achieving remarkable computational efficiency, generating personalized portraits approximately 15 times faster than existing approaches.
Abstract (translated)
最近,使用文本到图像扩散模型进行个性化肖像生成在引入Textual Inversion技术后取得了显著进展,成为创建高保真个性化图像的一种有前景的方法。尽管具有巨大潜力,现有的Textual Inversion方法却难以保持面部身份的一致性,因为文本和视觉嵌入空间之间关于身份的语义不匹配导致了这个问题。我们提出了ID-EA(Identity-Driven Embedding Alignment),这是一种新的框架,它引导文本嵌入与视觉身份嵌入对齐,从而在个性化生成中提高身份保存的效果。 ID-EA包括两个关键组成部分:由身份驱动的增强器(ID-Enhancer)和根据身份条件适配器(ID-Adapter)。首先,ID-Enhancer 将身份嵌入与文本 ID 锚点集成起来,并使用来自面部识别模型的视觉身份嵌入对代表性文本嵌入进行细化。然后,ID-Adapter 利用增强后的身份嵌入来调整文本条件,在预训练的 UNet 模型中通过调节交叉注意力模块确保身份保存。这一过程鼓励文本特征找到前景片段中的最相关视觉线索。 广泛的定量和定性评估表明,与现有方法相比,ID-EA 在身份保持度量方面显著超越了最先进的技术,并且实现了令人瞩目的计算效率,在生成个性化肖像时比现有的方法快大约15倍。
URL
https://arxiv.org/abs/2507.11990