Paper Reading AI Learner

ID-EA: Identity-driven Text Enhancement and Adaptation with Textual Inversion for Personalized Text-to-Image Generation

2025-07-16 07:42:02
Hyun-Jun Jin, Young-Eun Kim, Seong-Whan Lee

Abstract

Recently, personalized portrait generation with a text-to-image diffusion model has significantly advanced with Textual Inversion, emerging as a promising approach for creating high-fidelity personalized images. Despite its potential, current Textual Inversion methods struggle to maintain consistent facial identity due to semantic misalignments between textual and visual embedding spaces regarding identity. We introduce ID-EA, a novel framework that guides text embeddings to align with visual identity embeddings, thereby improving identity preservation in a personalized generation. ID-EA comprises two key components: the ID-driven Enhancer (ID-Enhancer) and the ID-conditioned Adapter (ID-Adapter). First, the ID-Enhancer integrates identity embeddings with a textual ID anchor, refining visual identity embeddings derived from a face recognition model using representative text embeddings. Then, the ID-Adapter leverages the identity-enhanced embedding to adapt the text condition, ensuring identity preservation by adjusting the cross-attention module in the pre-trained UNet model. This process encourages the text features to find the most related visual clues across the foreground snippets. Extensive quantitative and qualitative evaluations demonstrate that ID-EA substantially outperforms state-of-the-art methods in identity preservation metrics while achieving remarkable computational efficiency, generating personalized portraits approximately 15 times faster than existing approaches.

Abstract (translated)

最近,使用文本到图像扩散模型进行个性化肖像生成在引入Textual Inversion技术后取得了显著进展,成为创建高保真个性化图像的一种有前景的方法。尽管具有巨大潜力,现有的Textual Inversion方法却难以保持面部身份的一致性,因为文本和视觉嵌入空间之间关于身份的语义不匹配导致了这个问题。我们提出了ID-EA(Identity-Driven Embedding Alignment),这是一种新的框架,它引导文本嵌入与视觉身份嵌入对齐,从而在个性化生成中提高身份保存的效果。 ID-EA包括两个关键组成部分:由身份驱动的增强器(ID-Enhancer)和根据身份条件适配器(ID-Adapter)。首先,ID-Enhancer 将身份嵌入与文本 ID 锚点集成起来,并使用来自面部识别模型的视觉身份嵌入对代表性文本嵌入进行细化。然后,ID-Adapter 利用增强后的身份嵌入来调整文本条件,在预训练的 UNet 模型中通过调节交叉注意力模块确保身份保存。这一过程鼓励文本特征找到前景片段中的最相关视觉线索。 广泛的定量和定性评估表明,与现有方法相比,ID-EA 在身份保持度量方面显著超越了最先进的技术,并且实现了令人瞩目的计算效率,在生成个性化肖像时比现有的方法快大约15倍。

URL

https://arxiv.org/abs/2507.11990

PDF

https://arxiv.org/pdf/2507.11990.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot