Paper Reading AI Learner

Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization

2024-03-21 06:03:51
Yeji Song, Jimyeong Kim, Wonhark Park, Wonsik Shin, Wonjong Rhee, Nojun Kwak

Abstract

In a surge of text-to-image (T2I) models and their customization methods that generate new images of a user-provided subject, current works focus on alleviating the costs incurred by a lengthy per-subject optimization. These zero-shot customization methods encode the image of a specified subject into a visual embedding which is then utilized alongside the textual embedding for diffusion guidance. The visual embedding incorporates intrinsic information about the subject, while the textual embedding provides a new, transient context. However, the existing methods often 1) are significantly affected by the input images, eg., generating images with the same pose, and 2) exhibit deterioration in the subject's identity. We first pin down the problem and show that redundant pose information in the visual embedding interferes with the textual embedding containing the desired pose information. To address this issue, we propose orthogonal visual embedding which effectively harmonizes with the given textual embedding. We also adopt the visual-only embedding and inject the subject's clear features utilizing a self-attention swap. Our results demonstrate the effectiveness and robustness of our method, which offers highly flexible zero-shot generation while effectively maintaining the subject's identity.

Abstract (translated)

文本到图像(T2I)模型及其定制方法产生了用户提供的主题的新图像。 当前的工作集中精力减轻长时间每个主题优化所产生的成本。 这些零 shot 定制方法将指定主题的图像编码成视觉嵌入,然后在与文本嵌入一起用于扩散指导时利用该视觉嵌入。 视觉嵌入包含主题的固有信息,而文本嵌入提供了一个新的、暂时的上下文。 然而,现有的方法通常受到输入图像的巨大影响,例如生成相同姿势的图像,并且表现出主题身份的恶化。 我们首先确定问题,并表明视觉嵌入中冗余的姿势信息干扰了包含所需姿势信息的文本嵌入。 为解决此问题,我们提出了一种正交的视觉嵌入,与给定的文本嵌入有效和谐。 我们还采用视觉 only 嵌入,并使用自注意交换注入主题的清晰特征。 我们的结果证明了我们的方法的 effectiveness 和鲁棒性,该方法在零 shot 生成的同时有效地保持了主题的身份。

URL

https://arxiv.org/abs/2403.14155

PDF

https://arxiv.org/pdf/2403.14155.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot