Abstract
In a surge of text-to-image (T2I) models and their customization methods that generate new images of a user-provided subject, current works focus on alleviating the costs incurred by a lengthy per-subject optimization. These zero-shot customization methods encode the image of a specified subject into a visual embedding which is then utilized alongside the textual embedding for diffusion guidance. The visual embedding incorporates intrinsic information about the subject, while the textual embedding provides a new, transient context. However, the existing methods often 1) are significantly affected by the input images, eg., generating images with the same pose, and 2) exhibit deterioration in the subject's identity. We first pin down the problem and show that redundant pose information in the visual embedding interferes with the textual embedding containing the desired pose information. To address this issue, we propose orthogonal visual embedding which effectively harmonizes with the given textual embedding. We also adopt the visual-only embedding and inject the subject's clear features utilizing a self-attention swap. Our results demonstrate the effectiveness and robustness of our method, which offers highly flexible zero-shot generation while effectively maintaining the subject's identity.
Abstract (translated)
文本到图像(T2I)模型及其定制方法产生了用户提供的主题的新图像。 当前的工作集中精力减轻长时间每个主题优化所产生的成本。 这些零 shot 定制方法将指定主题的图像编码成视觉嵌入,然后在与文本嵌入一起用于扩散指导时利用该视觉嵌入。 视觉嵌入包含主题的固有信息,而文本嵌入提供了一个新的、暂时的上下文。 然而,现有的方法通常受到输入图像的巨大影响,例如生成相同姿势的图像,并且表现出主题身份的恶化。 我们首先确定问题,并表明视觉嵌入中冗余的姿势信息干扰了包含所需姿势信息的文本嵌入。 为解决此问题,我们提出了一种正交的视觉嵌入,与给定的文本嵌入有效和谐。 我们还采用视觉 only 嵌入,并使用自注意交换注入主题的清晰特征。 我们的结果证明了我们的方法的 effectiveness 和鲁棒性,该方法在零 shot 生成的同时有效地保持了主题的身份。
URL
https://arxiv.org/abs/2403.14155