Diffusion-based technologies have made significant strides, particularly in personalized and customized facialgeneration. However, existing methods face challenges in achieving high-fidelity and detailed identity (ID)consistency, primarily due to insufficient fine-grained control over facial areas and the lack of a comprehensive strategy for ID preservation by fully considering intricate facial details and the overall face. To address these limitations, we introduce ConsistentID, an innovative method crafted for diverseidentity-preserving portrait generation under fine-grained multimodal facial prompts, utilizing only a single reference image. ConsistentID comprises two key components: a multimodal facial prompt generator that combines facial features, corresponding facial descriptions and the overall facial context to enhance precision in facial details, and an ID-preservation network optimized through the facial attention localization strategy, aimed at preserving ID consistency in facial regions. Together, these components significantly enhance the accuracy of ID preservation by introducing fine-grained multimodal ID information from facial regions. To facilitate training of ConsistentID, we present a fine-grained portrait dataset, FGID, with over 500,000 facial images, offering greater diversity and comprehensiveness than existing public facial datasets. % such as LAION-Face, CelebA, FFHQ, and SFHQ. Experimental results substantiate that our ConsistentID achieves exceptional precision and diversity in personalized facial generation, surpassing existing methods in the MyStyle dataset. Furthermore, while ConsistentID introduces more multimodal ID information, it maintains a fast inference speed during generation.
基于扩散的技术取得了显著的进展,特别是在个性化面部生成方面。然而,现有的方法在实现高保真度和详细身份一致性方面面临挑战,主要原因是面部区域细粒度控制不足,以及没有全面考虑面部细节和整个面部以实现ID保留的策略。为了应对这些局限,我们引入了ConsistentID,一种为在细粒度多模态面部提示下生成多样化身份肖像的创新方法,仅使用单个参考图像。ConsistentID包括两个关键组件:一个多模态面部提示生成器,将面部特征、相应的面部描述和整个面部上下文相结合以提高面部细节的精度,和一个通过面部关注局部定位策略优化的ID保留网络,旨在保留面部区域ID的一致性。这些组件一起显著提高了ID保留的准确性,通过引入面部区域的细粒度多模态ID信息。为了方便ConsistentID的训练,我们提出了一个超过50万张面部图片的细粒度肖像数据集FGID,比现有的公共面部数据集(如LAION-Face,CelebA,FFHQ和SFHQ)具有更大的多样性和完整性。实验结果证实,我们的ConsistentID在个性化面部生成方面实现了非凡的精度和多样性,超过了MyStyle数据集中的现有方法。此外,虽然ConsistentID引入了更多的多模态ID信息,但在生成过程中保持了快速的推理速度。
https://arxiv.org/abs/2404.16771
The rapid evolution of text-to-image diffusion models has opened the door of generative AI, enabling the translation of textual descriptions into visually compelling images with remarkable quality. However, a persistent challenge within this domain is the optimization of prompts to effectively convey abstract concepts into concrete objects. For example, text encoders can hardly express "peace", while can easily illustrate olive branches and white doves. This paper introduces a novel approach named Prompt Optimizer for Abstract Concepts (POAC) specifically designed to enhance the performance of text-to-image diffusion models in interpreting and generating images from abstract concepts. We propose a Prompt Language Model (PLM), which is initialized from a pre-trained language model, and then fine-tuned with a curated dataset of abstract concept prompts. The dataset is created with GPT-4 to extend the abstract concept to a scene and concrete objects. Our framework employs a Reinforcement Learning (RL)-based optimization strategy, focusing on the alignment between the generated images by a stable diffusion model and optimized prompts. Through extensive experiments, we demonstrate that our proposed POAC significantly improves the accuracy and aesthetic quality of generated images, particularly in the description of abstract concepts and alignment with optimized prompts. We also present a comprehensive analysis of our model's performance across diffusion models under different settings, showcasing its versatility and effectiveness in enhancing abstract concept representation.
文本到图像扩散模型的快速演变为生成型人工智能打开了大门,使将文本描述转化为具有引人入胜视觉效果的图像成为可能,特别是在描述抽象概念方面。然而,这一领域的一个持续挑战是优化提示以有效地传达抽象概念为具体物体。例如,文本编码器很难表达“和平”,但可以轻松地描绘橄榄枝和白鸽子。本文介绍了一种名为抽象概念提示优化器(POAC)的新颖方法,专门设计用于提高文本到图像扩散模型在解释和生成图像时的性能。我们提出了一个基于预训练语言模型的Prompt语言模型(PLM),然后用经过精心挑选的抽象概念提示数据集进行微调。数据集使用GPT-4来扩展抽象概念场景和物体。我们的框架采用了一种基于强化学习的优化策略,重点关注通过稳定的扩散模型生成的图像与优化提示之间的对齐。通过广泛的实验,我们证明了我们的POAC显著提高了生成图像的准确性和美学质量,特别是在描述抽象概念和与优化提示对齐方面。我们还对在不同设置下的扩散模型性能进行了全面的分析,展示了模型在增强抽象概念表示方面的多样性和有效性。
https://arxiv.org/abs/2404.11589
Existing neural rendering-based text-to-3D-portrait generation methods typically make use of human geometry prior and diffusion models to obtain guidance. However, relying solely on geometry information introduces issues such as the Janus problem, over-saturation, and over-smoothing. We present Portrait3D, a novel neural rendering-based framework with a novel joint geometry-appearance prior to achieve text-to-3D-portrait generation that overcomes the aforementioned issues. To accomplish this, we train a 3D portrait generator, 3DPortraitGAN-Pyramid, as a robust prior. This generator is capable of producing 360° canonical 3D portraits, serving as a starting point for the subsequent diffusion-based generation process. To mitigate the "grid-like" artifact caused by the high-frequency information in the feature-map-based 3D representation commonly used by most 3D-aware GANs, we integrate a novel pyramid tri-grid 3D representation into 3DPortraitGAN-Pyramid. To generate 3D portraits from text, we first project a randomly generated image aligned with the given prompt into the pre-trained 3DPortraitGAN-Pyramid's latent space. The resulting latent code is then used to synthesize a pyramid tri-grid. Beginning with the obtained pyramid tri-grid, we use score distillation sampling to distill the diffusion model's knowledge into the pyramid tri-grid. Following that, we utilize the diffusion model to refine the rendered images of the 3D portrait and then use these refined images as training data to further optimize the pyramid tri-grid, effectively eliminating issues with unrealistic color and unnatural artifacts. Our experimental results show that Portrait3D can produce realistic, high-quality, and canonical 3D portraits that align with the prompt.
现有的基于神经渲染的文本-3D人物生成方法通常利用人体几何信息和扩散模型来获得指导。然而,仅依赖几何信息会引入诸如Janus问题、过度饱和和过度平滑等问题。我们提出了Portrait3D,一种新型的基于神经渲染的框架,具有新颖的联合几何-外观先验,以实现文本-3D人物生成,从而克服上述问题。为了实现这一目标,我们训练了一个3D人物生成器--3DPortraitGAN-Pyramid作为稳健的前体。这个生成器能够生成360°的规范3D人物,作为后续扩散-based生成过程的起点。为了减轻由高频信息引起的“网格状”伪影问题,我们将在3DPortraitGAN-Pyramid中集成一种新颖的等腰三角形3D表示。为了从文本中生成3D人物,我们首先将随机的图像与给定提示对齐,并将其投影到预训练的3DPortraitGAN-Pyramid的潜在空间中。得到的潜在代码随后用于合成等腰三角形。从获得的等腰三角形开始,我们使用评分差异抽样将扩散模型的知识引入到等腰三角形中。接着,我们利用扩散模型优化3D人物渲染图像,然后将这些优化后的图像作为训练数据进一步优化等腰三角形,有效地消除了不真实颜色和异常 artifacts。我们的实验结果表明,Portrait3D可以生成真实、高质量和规范的3D人物,与给定提示相符。
https://arxiv.org/abs/2404.10394
To enhance the controllability of text-to-image diffusion models, existing efforts like ControlNet incorporated image-based conditional controls. In this paper, we reveal that existing methods still face significant challenges in generating images that align with the image conditional controls. To this end, we propose ControlNet++, a novel approach that improves controllable generation by explicitly optimizing pixel-level cycle consistency between generated images and conditional controls. Specifically, for an input conditional control, we use a pre-trained discriminative reward model to extract the corresponding condition of the generated images, and then optimize the consistency loss between the input conditional control and extracted condition. A straightforward implementation would be generating images from random noises and then calculating the consistency loss, but such an approach requires storing gradients for multiple sampling timesteps, leading to considerable time and memory costs. To address this, we introduce an efficient reward strategy that deliberately disturbs the input images by adding noise, and then uses the single-step denoised images for reward fine-tuning. This avoids the extensive costs associated with image sampling, allowing for more efficient reward fine-tuning. Extensive experiments show that ControlNet++ significantly improves controllability under various conditional controls. For example, it achieves improvements over ControlNet by 7.9% mIoU, 13.4% SSIM, and 7.6% RMSE, respectively, for segmentation mask, line-art edge, and depth conditions.
为了增强文本到图像扩散模型的可控性,现有的诸如ControlNet的努力已经包含了图像为基础的条件控制。在本文中,我们揭示现有方法在生成与图像条件控制相符的图像方面仍然面临着重大挑战。为此,我们提出了ControlNet++,一种通过明确优化生成图像与条件控制之间像素级循环一致性来提高可控性新颖的方法。具体来说,对于输入条件控制,我们使用预训练的判别性奖励模型提取相应条件,然后优化输入条件与提取条件的一致性损失。一个简单的实现方法是从随机噪声中生成图像,然后计算一致性损失,但这种方法需要存储多个抽样时刻的梯度,导致大量的时间和内存成本。为了应对这个问题,我们引入了一种有效的奖励策略,故意在输入图像上添加噪声,然后使用单步去噪图像进行奖励微调。这避免了与图像采样相关的广泛成本,使得奖励微调更加高效。大量实验结果表明,ControlNet++在各种条件控制下显著提高了可控性。例如,在分割掩码、线形边和深度条件下,ControlNet++分别实现了7.9%的mIoU提高、13.4%的SSIM提高和7.6%的RMSE提高。
https://arxiv.org/abs/2404.07987
In this paper, we present MoMA: an open-vocabulary, training-free personalized image model that boasts flexible zero-shot capabilities. As foundational text-to-image models rapidly evolve, the demand for robust image-to-image translation grows. Addressing this need, MoMA specializes in subject-driven personalized image generation. Utilizing an open-source, Multimodal Large Language Model (MLLM), we train MoMA to serve a dual role as both a feature extractor and a generator. This approach effectively synergizes reference image and text prompt information to produce valuable image features, facilitating an image diffusion model. To better leverage the generated features, we further introduce a novel self-attention shortcut method that efficiently transfers image features to an image diffusion model, improving the resemblance of the target object in generated images. Remarkably, as a tuning-free plug-and-play module, our model requires only a single reference image and outperforms existing methods in generating images with high detail fidelity, enhanced identity-preservation and prompt faithfulness. Our work is open-source, thereby providing universal access to these advancements.
在本文中,我们提出了MoMA:一个开放词汇、无需训练的个性化图像模型,具有灵活的零样本能力。随着基础文本到图像模型的快速演变,对稳健图像到图像转换的需求不断增加。为满足这一需求,MoMA专注于主题驱动的个性化图像生成。利用一个开源的Multimodal Large Language Model(MLLM),我们训练MoMA既作为特征提取器又作为生成器。这种方法有效地将参考图像和文本提示信息协同作用,产生有价值的图像特征,促进图像扩散模型。为了更好地利用生成的特征,我们进一步引入了一种新颖的自注意短路方法,该方法有效地将图像特征转移到图像扩散模型中,提高了生成图像中目标对象的相似度。值得注意的是,作为无需调度的插件和播放模块,我们的模型只需要一个参考图像,并且在生成具有高细节保真度、增强身份保留和提示信仰的图像方面优于现有方法。我们的工作是开源的,从而为这些创新提供了统一的访问。
https://arxiv.org/abs/2404.05674
Recent strides in the development of diffusion models, exemplified by advancements such as Stable Diffusion, have underscored their remarkable prowess in generating visually compelling images. However, the imperative of achieving a seamless alignment between the generated image and the provided prompt persists as a formidable challenge. This paper traces the root of these difficulties to invalid initial noise, and proposes a solution in the form of Initial Noise Optimization (InitNO), a paradigm that refines this noise. Considering text prompts, not all random noises are effective in synthesizing semantically-faithful images. We design the cross-attention response score and the self-attention conflict score to evaluate the initial noise, bifurcating the initial latent space into valid and invalid sectors. A strategically crafted noise optimization pipeline is developed to guide the initial noise towards valid regions. Our method, validated through rigorous experimentation, shows a commendable proficiency in generating images in strict accordance with text prompts. Our code is available at this https URL.
近年来在扩散模型的研究中,例如稳定性扩散(Stable Diffusion)等先进技术的进步,已经强调了它们在生成视觉上令人印象深刻的图像方面的非凡能力。然而,实现生成图像与提供提示之间无缝对齐的需求仍然是一个难以克服的挑战。本文追溯这些困难的根源是无效的初始噪声,并提出了一个解决方案,形式为初始噪声优化(InitNO),这是一种范式,用于细化这种噪声。 考虑到文本提示,不是所有的随机噪声都能有效地生成 semantically-faithful(根据文本内容一致性)的图像。我们设计了一个跨注意力和自注意冲突评分来评估初始噪声,将初始局部空间划分为有效和无效领域。为了引导初始噪声流向有效区域,我们设计了一个策略化的噪声优化管道。 通过严谨的实验验证,我们的方法在生成与文本提示完全一致的图像方面表现出卓越的性能。我们的代码可在此链接处获取:https://github.com/your_username/your_repo_name
https://arxiv.org/abs/2404.04650
While there has been significant progress in customizing text-to-image generation models, generating images that combine multiple personalized concepts remains challenging. In this work, we introduce Concept Weaver, a method for composing customized text-to-image diffusion models at inference time. Specifically, the method breaks the process into two steps: creating a template image aligned with the semantics of input prompts, and then personalizing the template using a concept fusion strategy. The fusion strategy incorporates the appearance of the target concepts into the template image while retaining its structural details. The results indicate that our method can generate multiple custom concepts with higher identity fidelity compared to alternative approaches. Furthermore, the method is shown to seamlessly handle more than two concepts and closely follow the semantic meaning of the input prompt without blending appearances across different subjects.
尽管在自定义文本到图像生成模型的定制方面已经取得了显著的进展,但生成结合多个个性化概念的图像仍然具有挑战性。在这项工作中,我们引入了Concept Weaver方法,一种在推理时组合自定义文本到图像扩散模型的方法。具体来说,方法分为两个步骤:创建与输入提示的语义对齐的模板图像,然后使用概念融合策略个性化模板。融合策略将目标概念的 appearance 融入模板图像中,同时保留其结构细节。结果显示,与 alternative 方法相比,我们的方法可以生成具有更高identity fidelity 的多个自定义概念。此外,该方法被证明可以轻松处理超过两个概念,并且可以紧密跟踪输入提示的语义意义,而不会在不同的主题之间混淆外观。
https://arxiv.org/abs/2404.03913
In recent years, the emergence of models capable of generating images from text has attracted considerable interest, offering the possibility of creating realistic images from text descriptions. Yet these advances have also raised concerns about the potential misuse of these images, including the creation of misleading content such as fake news and propaganda. This study investigates the effectiveness of using advanced vision-language models (VLMs) for synthetic image identification. Specifically, the focus is on tuning state-of-the-art image captioning models for synthetic image detection. By harnessing the robust understanding capabilities of large VLMs, the aim is to distinguish authentic images from synthetic images produced by diffusion-based models. This study contributes to the advancement of synthetic image detection by exploiting the capabilities of visual language models such as BLIP-2 and ViTGPT2. By tailoring image captioning models, we address the challenges associated with the potential misuse of synthetic images in real-world applications. Results described in this paper highlight the promising role of VLMs in the field of synthetic image detection, outperforming conventional image-based detection techniques. Code and models can be found at this https URL.
近年来,能够从文本生成图像的模型的发展引起了相当大的关注,为从文本描述创建真实图像提供了可能。然而,这些进步也引发了对这些图像可能被滥用的高度关注,包括创建虚假新闻和宣传的内容。本研究旨在调查使用先进视觉语言模型(VLMs)进行合成图像识别的有效性。具体来说,重点是调整最先进的图像描述模型以进行合成图像检测。通过利用大型 VLMs 的稳健理解能力,目标是将真实图像与由扩散模型生成的合成图像区分开来。本研究为合成图像检测的发展做出了贡献,利用了视觉语言模型(如BLIP-2和ViTGPT2)的 capabilities。通过调整图像描述模型,我们解决了与合成图像在现实应用中可能被滥用的相关挑战。本文中描述的结果突出了VLMs在合成图像检测领域的前景,超过了基于图像的检测技术。代码和模型可在此链接找到:https://www.example.com/
https://arxiv.org/abs/2404.02726
Two years ago, Stable Diffusion achieved super-human performance at generating images with super-human numbers of fingers. Following the steady decline of its technical novelty, we propose Stale Diffusion, a method that solidifies and ossifies Stable Diffusion in a maximum-entropy state. Stable Diffusion works analogously to a barn (the Stable) from which an infinite set of horses have escaped (the Diffusion). As the horses have long left the barn, our proposal may be seen as antiquated and irrelevant. Nevertheless, we vigorously defend our claim of novelty by identifying as early adopters of the Slow Science Movement, which will produce extremely important pearls of wisdom in the future. Our speed of contributions can also be seen as a quasi-static implementation of the recent call to pause AI experiments, which we wholeheartedly support. As a result of a careful archaeological expedition to 18-months-old Git commit histories, we found that naturally-accumulating errors have produced a novel entropy-maximising Stale Diffusion method, that can produce sleep-inducing hyper-realistic 5D video that is as good as one's imagination.
两年前,Stable Diffusion通过生成具有超人类手指数量的图像实现了超人类性能。在技术新颖性逐渐下降之后,我们提出了Stale Diffusion,一种将Stable Diffusion巩固和固化的方法,使其达到最大熵状态。与Stable Diffusion相似,该方法可以看作是从一个无限种马中逃出的谷仓(the Stable)一样。随着马已经离开谷仓很久了,我们的提议可能被视为过时和无关紧要。然而,我们坚定地捍卫我们的创新地位,称自己是慢科学运动的早期采用者,将在未来产生极其重要的智慧珍珠。我们的贡献速度也可以看作是对最近呼吁暂停AI实验的积极响应,我们完全支持。通过仔细的考古探险,我们发现了自然累积误差产生了一种新的最大熵Stale Diffusion方法,可以生成与想象中同样逼真的5D视频,具有催眠式的超现实感。
https://arxiv.org/abs/2404.01079
Recent advances in the diffusion models have significantly improved text-to-image generation. However, generating videos from text is a more challenging task than generating images from text, due to the much larger dataset and higher computational cost required. Most existing video generation methods use either a 3D U-Net architecture that considers the temporal dimension or autoregressive generation. These methods require large datasets and are limited in terms of computational costs compared to text-to-image generation. To tackle these challenges, we propose a simple but effective novel grid diffusion for text-to-video generation without temporal dimension in architecture and a large text-video paired dataset. We can generate a high-quality video using a fixed amount of GPU memory regardless of the number of frames by representing the video as a grid image. Additionally, since our method reduces the dimensions of the video to the dimensions of the image, various image-based methods can be applied to videos, such as text-guided video manipulation from image manipulation. Our proposed method outperforms the existing methods in both quantitative and qualitative evaluations, demonstrating the suitability of our model for real-world video generation.
近年来,扩散模型的进步显著提高了文本到图像生成。然而,从文本生成视频比从文本生成图像更具挑战性,因为需要更大的数据集和更高的计算成本。现有的视频生成方法通常使用3D U-Net架构来考虑时间维度,或者使用自回归生成。这些方法需要大量数据,在计算成本方面有限。为了应对这些挑战,我们提出了一个简单但有效的 novel grid diffusion for text-to-video generation without temporal dimension in architecture 和一个大型文本-视频对数据集。我们可以使用固定量的GPU内存生成高质量的视频,而不管帧数多少。此外,由于我们的方法将视频的维度降低到图像的维度,各种基于图像的方法可以应用于视频,如从图像编辑中的图像指导视频操作。我们在定量和定性评估中证明了我们模型的合适性,表明我们的模型在现实世界的视频生成中具有优势。
https://arxiv.org/abs/2404.00234
Despite their exceptional generative abilities, large text-to-image diffusion models, much like skilled but careless artists, often struggle with accurately depicting visual relationships between objects. This issue, as we uncover through careful analysis, arises from a misaligned text encoder that struggles to interpret specific relationships and differentiate the logical order of associated objects. To resolve this, we introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate. To address this, we propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN). It models the directional relationships between relation terms and corresponding objects within the input prompts. Specifically, we optimize the HGCN on a pair of prompts with identical relational words but reversed object orders, supplemented by a few reference images. The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space. Crucially, our method retains the parameters of the text encoder and diffusion model, preserving the model's robust performance on unrelated descriptions. We validated our approach on a newly curated dataset of diverse relational data, demonstrating both quantitative and qualitative enhancements in generating images with precise visual relations. Project page: this https URL.
尽管它们具有出色的生成能力,大型文本到图像扩散模型(如熟练但粗心的艺术家)通常在准确描绘物体之间的视觉关系方面遇到困难。通过仔细分析,我们发现这一问题源于一个失衡的文本编码器,它难以解释具体的关系,并区分相关对象的逻辑顺序。为解决这个问题,我们引入了一个名为关系纠正的新任务,旨在优化模型以准确表示其最初无法生成的关系。为解决这一问题,我们提出了一种创新的方法利用异质图卷积网络(HGCN)。它通过输入提示来建模关系词汇之间和相应物体之间的方向关系。具体来说,我们在一对具有相同关系词但反向物体顺序的提示上优化HGCN,并补充了几个参考图像。轻量级的HGCN调整了由文本编码器生成的文本嵌入,确保了文本中关系的准确映射在嵌入空间中的反映。关键的是,我们的方法保留了文本编码器和解扩散模型的参数,保持模型在不相关描述上的稳健性能。我们在包含多样关系数据的新数据集中评估了我们的方法,证明了在生成精确视觉关系图片方面 both quantitative and qualitative enhancements。项目页面:this <https://this URL>.
https://arxiv.org/abs/2403.20249
Over the past few years, Text-to-Image (T2I) generation approaches based on diffusion models have gained significant attention. However, vanilla diffusion models often suffer from spelling inaccuracies in the text displayed within the generated images. The capability to generate visual text is crucial, offering both academic interest and a wide range of practical applications. To produce accurate visual text images, state-of-the-art techniques adopt a glyph-controlled image generation approach, consisting of a text layout generator followed by an image generator that is conditioned on the generated text layout. Nevertheless, our study reveals that these models still face three primary challenges, prompting us to develop a testbed to facilitate future research. We introduce a benchmark, LenCom-Eval, specifically designed for testing models' capability in generating images with Lengthy and Complex visual text. Subsequently, we introduce a training-free framework to enhance the two-stage generation approaches. We examine the effectiveness of our approach on both LenCom-Eval and MARIO-Eval benchmarks and demonstrate notable improvements across a range of evaluation metrics, including CLIPScore, OCR precision, recall, F1 score, accuracy, and edit distance scores. For instance, our proposed framework improves the backbone model, TextDiffuser, by more than 23\% and 13.5\% in terms of OCR word F1 on LenCom-Eval and MARIO-Eval, respectively. Our work makes a unique contribution to the field by focusing on generating images with long and rare text sequences, a niche previously unexplored by existing literature
在过去的几年里,基于扩散模型的文本转图像(T2I)生成方法已经获得了显著的关注。然而,基本的扩散模型通常在生成的图像中显示的文本中存在拼写不准确的问题。生成视觉文本的能力至关重要,既具有学术意义,又具有广泛的应用价值。为了产生准确的视觉文本图像,最先进的技术采用了一种基于字符级别控制的图像生成方法,包括一个文本布局生成器和一个根据生成的文本布局进行条件的图像生成器。然而,我们的研究揭示了这些模型仍然面临三个主要挑战,促使我们开发一个测试平台来促进未来的研究。我们引入了一个专门为测试模型生成具有长篇和复杂视觉文本的图像而设计的基准,即LenCom-Eval。接着,我们引入了一个无需训练的框架来增强两种级联生成方法。我们在LenCom-Eval和MARIO-Eval基准上评估了我们的方法的有效性,并展示了在包括CLIPScore、OCR精度、召回、F1分数、准确性和编辑距离分数在内的各种评估指标上显着改善。例如,与基准模型相比,我们提出的框架在LenCom-Eval基准上提高了超过23%,而在MARIO-Eval基准上提高了13.5%。我们的工作为该领域通过专注于生成长篇和罕见文本序列的图像做出了独特的贡献,而这一领域之前尚未被现有文献所探索。
https://arxiv.org/abs/2403.16422
Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-shot classification, text-image retrieval, and text-image generation by aligning image and text modalities. Despite its widespread adoption, a significant limitation of CLIP lies in the inadequate length of text input. The length of the text token is restricted to 77, and an empirical study shows the actual effective length is even less than 20. This prevents CLIP from handling detailed descriptions, limiting its applications for image retrieval and text-to-image generation with extensive prerequisites. To this end, we propose Long-CLIP as a plug-and-play alternative to CLIP that supports long-text input, retains or even surpasses its zero-shot generalizability, and aligns the CLIP latent space, making it readily replace CLIP without any further adaptation in downstream frameworks. Nevertheless, achieving this goal is far from straightforward, as simplistic fine-tuning can result in a significant degradation of CLIP's performance. Moreover, substituting the text encoder with a language model supporting longer contexts necessitates pretraining with vast amounts of data, incurring significant expenses. Accordingly, Long-CLIP introduces an efficient fine-tuning solution on CLIP with two novel strategies designed to maintain the original capabilities, including (1) a knowledge-preserved stretching of positional embedding and (2) a primary component matching of CLIP features. With leveraging just one million extra long text-image pairs, Long-CLIP has shown the superiority to CLIP for about 20% in long caption text-image retrieval and 6% in traditional text-image retrieval tasks, e.g., COCO and Flickr30k. Furthermore, Long-CLIP offers enhanced capabilities for generating images from detailed text descriptions by replacing CLIP in a plug-and-play manner.
对比性语言-图像预训练(CLIP)是零散分类、文本图像检索和文本图像生成的基石,通过将图像和文本模态对齐。尽管CLIP得到了广泛的采用,但CLIP的一个显著局限在于文本输入的长度不足。文本标记的长度限制为77个,而一个经验性的研究表明,实际有效长度甚至比20个更少。这使得CLIP无法处理详细的描述,限制了其在图像检索和具有广泛先决条件的文本-图像生成方面的应用。 为此,我们提出了Long-CLIP作为CLIP的插件和备选方案,支持长文本输入,保留或甚至超越零散分布的泛化能力,并使CLIP潜在空间对齐,使得在下游框架中无需进一步调整即可替代CLIP。然而,实现这一目标并不容易,因为简单的微调可能会导致CLIP性能的显著下降。此外,用支持较长上下文的语言模型替换文本编码器需要大量的预训练数据,产生相当大的费用。因此,Long-CLIP通过两种新颖策略在CLIP上实现有效微调,包括(1)保留位置嵌入的知识伸展和(2)与CLIP特征的主要成分匹配。借助仅利用100万对额外长文本图像对,Long-CLIP在长摘要文本图像检索和传统文本图像检索任务(如COCO和Flickr30k)中已经证明了与CLIP约20%的优越性。此外,Long-CLIP通过在插件和备选方式下生成图像,取代CLIP,从而增强其生成图像的能力。
https://arxiv.org/abs/2403.15378
Deep learning-based image generation has seen significant advancements with diffusion models, notably improving the quality of generated images. Despite these developments, generating images with unseen characteristics beneficial for downstream tasks has received limited attention. To bridge this gap, we propose Style-Extracting Diffusion Models, featuring two conditioning mechanisms. Specifically, we utilize 1) a style conditioning mechanism which allows to inject style information of previously unseen images during image generation and 2) a content conditioning which can be targeted to a downstream task, e.g., layout for segmentation. We introduce a trainable style encoder to extract style information from images, and an aggregation block that merges style information from multiple style inputs. This architecture enables the generation of images with unseen styles in a zero-shot manner, by leveraging styles from unseen images, resulting in more diverse generations. In this work, we use the image layout as target condition and first show the capability of our method on a natural image dataset as a proof-of-concept. We further demonstrate its versatility in histopathology, where we combine prior knowledge about tissue composition and unannotated data to create diverse synthetic images with known layouts. This allows us to generate additional synthetic data to train a segmentation network in a semi-supervised fashion. We verify the added value of the generated images by showing improved segmentation results and lower performance variability between patients when synthetic images are included during segmentation training. Our code will be made publicly available at [LINK].
基于深度学习的图像生成已经取得了显著的进步,特别是提高了生成图像的质量。尽管如此,利用未见特征生成图像对下游任务有益的研究仍然受到了很少的关注。为了填补这一空白,我们提出了Style-Extracting Diffusion Models,具有两个调节机制。具体来说,我们利用1)一种风格调节机制,允许在图像生成过程中注入以前未见图像的风格信息,以及2)一种内容调节机制,可以针对下游任务,例如分割的布局。我们引入了一个可训练的风格编码器来提取图像中的风格信息,和一个聚合块,用于合并多个风格输入的Style信息。这种架构使得可以在零散的视角下生成未见风格的图像,通过利用未见图像的风格信息,从而产生更加多样化的生成。 在这篇工作中,我们使用图像布局作为目标条件,首先证明我们的方法的可靠性。然后,我们在病理学领域进一步证明了其多才性,将以前关于组织组成的不确定知识与未标记数据相结合,生成了具有已知布局的多样合成图像。这使得我们可以在半监督方式下生成额外合成数据来训练分割网络。我们通过展示提高的分割结果和分割训练过程中患者之间的性能变异性来验证所生成图像的附加价值。我们的代码将公开发布在[LINK]上。
https://arxiv.org/abs/2403.14429
In a surge of text-to-image (T2I) models and their customization methods that generate new images of a user-provided subject, current works focus on alleviating the costs incurred by a lengthy per-subject optimization. These zero-shot customization methods encode the image of a specified subject into a visual embedding which is then utilized alongside the textual embedding for diffusion guidance. The visual embedding incorporates intrinsic information about the subject, while the textual embedding provides a new, transient context. However, the existing methods often 1) are significantly affected by the input images, eg., generating images with the same pose, and 2) exhibit deterioration in the subject's identity. We first pin down the problem and show that redundant pose information in the visual embedding interferes with the textual embedding containing the desired pose information. To address this issue, we propose orthogonal visual embedding which effectively harmonizes with the given textual embedding. We also adopt the visual-only embedding and inject the subject's clear features utilizing a self-attention swap. Our results demonstrate the effectiveness and robustness of our method, which offers highly flexible zero-shot generation while effectively maintaining the subject's identity.
文本到图像(T2I)模型及其定制方法产生了用户提供的主题的新图像。 当前的工作集中精力减轻长时间每个主题优化所产生的成本。 这些零 shot 定制方法将指定主题的图像编码成视觉嵌入,然后在与文本嵌入一起用于扩散指导时利用该视觉嵌入。 视觉嵌入包含主题的固有信息,而文本嵌入提供了一个新的、暂时的上下文。 然而,现有的方法通常受到输入图像的巨大影响,例如生成相同姿势的图像,并且表现出主题身份的恶化。 我们首先确定问题,并表明视觉嵌入中冗余的姿势信息干扰了包含所需姿势信息的文本嵌入。 为解决此问题,我们提出了一种正交的视觉嵌入,与给定的文本嵌入有效和谐。 我们还采用视觉 only 嵌入,并使用自注意交换注入主题的清晰特征。 我们的结果证明了我们的方法的 effectiveness 和鲁棒性,该方法在零 shot 生成的同时有效地保持了主题的身份。
https://arxiv.org/abs/2403.14155
Virtual Try-on (VTON) involves generating images of a person wearing selected garments. Diffusion-based methods, in particular, can create high-quality images, but they struggle to maintain the identities of the input garments. We identified this problem stems from the specifics in the training formulation for diffusion. To address this, we propose a unique training scheme that limits the scope in which diffusion is trained. We use a control image that perfectly aligns with the target image during training. In turn, this accurately preserves garment details during inference. We demonstrate our method not only effectively conserves garment details but also allows for layering, styling, and shoe try-on. Our method runs multi-garment try-on in a single inference cycle and can support high-quality zoomed-in generations without training in higher resolutions. Finally, we show our method surpasses prior methods in accuracy and quality.
虚拟试穿(VTON)涉及生成特定服装穿着的人物图像。特别是扩散方法可以创建高质量图像,但它们很难保留输入服装的个性。我们发现这个问题源于扩散训练 formulation 的具体细节。为解决这个问题,我们提出了一个独特的训练计划,限制了扩散训练的范围。我们使用一个控制图像,在训练过程中与目标图像完美对齐。进而,在推理过程中准确保留了服装细节。我们证明了我们的方法不仅有效地保留了服装细节,而且允许分层、造型和试穿。我们的方法在单个推理周期内运行多件试穿,并且可以支持高分辨率下的详细观察,而无需在训练中进行更高分辨率的开源。最后,我们证明了我们的方法在准确性和质量上超过了先前的算法。
https://arxiv.org/abs/2403.13951
Recent text-to-image (T2I) generation models have demonstrated impressive capabilities in creating images from text descriptions. However, these T2I generation models often fall short of generating images that precisely match the details of the text inputs, such as incorrect spatial relationship or missing objects. In this paper, we introduce SELMA: Skill-Specific Expert Learning and Merging with Auto-Generated Data, a novel paradigm to improve the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets, with skill-specific expert learning and merging. First, SELMA leverages an LLM's in-context learning capability to generate multiple datasets of text prompts that can teach different skills, and then generates the images with a T2I model based on the prompts. Next, SELMA adapts the T2I model to the new skills by learning multiple single-skill LoRA (low-rank adaptation) experts followed by expert merging. Our independent expert fine-tuning specializes multiple models for different skills, and expert merging helps build a joint multi-skill T2I model that can generate faithful images given diverse text prompts, while mitigating the knowledge conflict from different datasets. We empirically demonstrate that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks (+2.1% on TIFA and +6.9% on DSG), human preference metrics (PickScore, ImageReward, and HPS), as well as human evaluation. Moreover, fine-tuning with image-text pairs auto-collected via SELMA shows comparable performance to fine-tuning with ground truth data. Lastly, we show that fine-tuning with images from a weaker T2I model can help improve the generation quality of a stronger T2I model, suggesting promising weak-to-strong generalization in T2I models.
近年来,从文本描述创建图像(T2I)的模型已经展示了在从文本描述创建准确图像方面出色的能力。然而,这些T2I生成模型通常无法生成与文本输入完全匹配的图像,例如错误的空间关系或缺失的对象。在本文中,我们介绍了SELMA:专长特定专家学习与自动生成数据合并的新范式,通过在自动生成、多技能图像-文本数据集上微调模型来提高T2I模型的可靠性,包括专长特定专家学习和合并。首先,SELMA利用LLM的上下文学习能力生成多个技能的多文本提示,以教导不同的技能,然后基于提示生成图像。接下来,SELMA根据提示调整T2I模型以学习多个单技能LoRA(低秩适应)专家,然后进行专家合并。我们独立专家微调为不同技能的多个模型,专家合并有助于构建多技能T2I模型,在多样文本提示下生成准确图像,同时减轻不同数据集之间的知识冲突。我们通过实验证明了SELMA显著提高了最先进的T2I扩散模型的语义对齐度和文本可靠性(+2.1%在TIFA和+6.9%在DSG),人类偏好度指标(PickScore,图像奖励和HPS)以及人类评估。此外,通过SELMA自动收集图像-文本对的结果与通过地面真实数据进行微调的性能相当。最后,我们证明了用较弱的T2I模型进行微调可以提高较强T2I模型的生成质量,这表明了T2I模型的弱-到强的泛化具有前景。
https://arxiv.org/abs/2403.06952
Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements. To further improve T2I models' capability in numerical and spatial reasoning, the layout is employed as an intermedium to bridge large language models and layout-based diffusion models. However, these methods still struggle with generating images from textural prompts with multiple objects and complicated spatial relationships. To tackle this challenge, we introduce a divide-and-conquer approach which decouples the T2I generation task into simple subtasks. Our approach divides the layout prediction stage into numerical \& spatial reasoning and bounding box prediction. Then, the layout-to-image generation stage is conducted in an iterative manner to reconstruct objects from easy ones to difficult ones. We conduct experiments on the HRS and NSR-1K benchmarks and our approach outperforms previous state-of-the-art models with notable margins. In addition, visual results demonstrate that our approach significantly improves the controllability and consistency in generating multiple objects from complex textural prompts.
扩散式文本到图像(T2I)生成取得了显著的进步。为了进一步提高T2I模型的数值和空间推理能力,我们采用了布局作为中间媒介来连接大型语言模型和基于布局的扩散模型。然而,这些方法仍然难以处理文本纹理提示的多物体和复杂的空间关系。为了解决这个问题,我们引入了一种分而治之的方法,将T2I生成任务拆分成简单的子任务。我们的方法将布局预测阶段分为数值与空间推理以及边界框预测。然后,以迭代方式进行布局到图像生成阶段,从简单的物体到困难的物体进行重建。我们在HRS和NSR-1K基准上进行了实验,我们的方法与最先进的模型相比具有显著的领先优势。此外,视觉结果表明,我们的方法显著提高了从复杂纹理提示中生成多个对象的可控性和一致性。
https://arxiv.org/abs/2403.06400
The rapid expansion of large-scale text-to-image diffusion models has raised growing concerns regarding their potential misuse in creating harmful or misleading content. In this paper, we introduce MACE, a finetuning framework for the task of mass concept erasure. This task aims to prevent models from generating images that embody unwanted concepts when prompted. Existing concept erasure methods are typically restricted to handling fewer than five concepts simultaneously and struggle to find a balance between erasing concept synonyms (generality) and maintaining unrelated concepts (specificity). In contrast, MACE differs by successfully scaling the erasure scope up to 100 concepts and by achieving an effective balance between generality and specificity. This is achieved by leveraging closed-form cross-attention refinement along with LoRA finetuning, collectively eliminating the information of undesirable concepts. Furthermore, MACE integrates multiple LoRAs without mutual interference. We conduct extensive evaluations of MACE against prior methods across four different tasks: object erasure, celebrity erasure, explicit content erasure, and artistic style erasure. Our results reveal that MACE surpasses prior methods in all evaluated tasks. Code is available at this https URL.
大规模文本到图像扩散模型的快速扩张引发了对它们在创建有害或误导性内容方面的潜在滥用担忧。在本文中,我们介绍了MACE,一个用于大型概念消除任务的微调框架。该任务旨在防止模型在提示下生成具有不良概念的图像。现有的概念消除方法通常仅处理五个或更少的概念,并且很难在消除概念同义词(普遍性)和保持无关概念(特异性)之间找到平衡。相比之下,MACE通过成功将消除范围扩展到100个概念,并实现普遍性和特异性之间的有效平衡,实现了这一目标。这是通过利用闭式形式跨注意力和LoRA微调来共同消除不良概念信息实现的。此外,MACE集成了多个LoRA,没有相互干扰。我们对MACE与先前方法在四个不同任务上的表现进行了广泛评估:对象消除、名人消除、明确内容消除和艺术风格消除。我们的结果表明,MACE在所有评估任务上都超过了先前方法。代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2403.06135
Text-to-image diffusion models (T2I) use a latent representation of a text prompt to guide the image generation process. However, the process by which the encoder produces the text representation is unknown. We propose the Diffusion Lens, a method for analyzing the text encoder of T2I models by generating images from its intermediate representations. Using the Diffusion Lens, we perform an extensive analysis of two recent T2I models. Exploring compound prompts, we find that complex scenes describing multiple objects are composed progressively and more slowly compared to simple scenes; Exploring knowledge retrieval, we find that representation of uncommon concepts requires further computation compared to common concepts, and that knowledge retrieval is gradual across layers. Overall, our findings provide valuable insights into the text encoder component in T2I pipelines.
文本到图像扩散模型(T2I)使用文本提示的潜在表示来指导图像生成过程。然而,编码器产生文本表示的过程是未知的。我们提出了一种名为扩散镜头的方法,通过从其中间表示生成图像来分析T2I模型的文本编码器。使用扩散镜头,我们对两个最近的T2I模型进行了广泛的分析。探索组合提示,我们发现复杂场景描述多个对象的逐渐和缓慢组成与简单场景相比;探索知识检索,我们发现与常见概念相比,不寻常概念的表示需要进一步计算,并且知识检索在层次结构中是逐渐发展的。总体而言,我们的研究结果为T2I模型的文本编码器部件提供了宝贵的洞见。
https://arxiv.org/abs/2403.05846