The influence of textures on machine learning models has been an ongoing investigation, specifically in texture bias/learning, interpretability, and robustness. However, due to the lack of large and diverse texture data available, the findings in these works have been limited, as more comprehensive evaluations have not been feasible. Image generative models are able to provide data creation at scale, but utilizing these models for texture synthesis has been unexplored and poses additional challenges both in creating accurate texture images and validating those images. In this work, we introduce an extensible methodology and corresponding new dataset for generating high-quality, diverse texture images capable of supporting a broad set of texture-based tasks. Our pipeline consists of: (1) developing prompts from a range of descriptors to serve as input to text-to-image models, (2) adopting and adapting Stable Diffusion pipelines to generate and filter the corresponding images, and (3) further filtering down to the highest quality images. Through this, we create the Prompted Textures Dataset (PTD), a dataset of 362,880 texture images that span 56 textures. During the process of generating images, we find that NSFW safety filters in image generation pipelines are highly sensitive to texture (and flag up to 60\% of our texture images), uncovering a potential bias in these models and presenting unique challenges when working with texture data. Through both standard metrics and a human evaluation, we find that our dataset is high quality and diverse.
纹理对机器学习模型的影响是一个持续的研究,尤其是在纹理偏见/学习、可解释性和鲁棒性方面。然而,由于缺乏大型和多样纹理数据集,这些工作中的发现是有限的,因为更全面的评估是不可行的。图像生成模型能够在大规模范围内提供数据创建,但将这些模型用于纹理合成仍然是未探索的,并且还提出了在创建准确纹理图像和验证这些图像时的新挑战。在这项工作中,我们介绍了一种可扩展的方法和相应的纹理图像生成新数据集,该数据集能够支持广泛的纹理基任务。我们的工作流程包括: (1)从各种描述符中开发提示,作为文本到图像模型的输入, (2)采用并适应 Stable Diffusion 管道生成和过滤相应的图像, (3)进一步筛选到最高质量的图像。 通过这个过程,我们创建了 Prompted Textures Dataset(PTD),一个包含 362,880 个纹理图像的数据集,跨越 56 个纹理。在生成图像的过程中,我们发现,图像生成管道中的 NSFW 安全滤波器对纹理(纹理图片的 flag 高达 60\%)非常敏感,揭示了这些模型中存在的偏见,并且在处理纹理数据时带来了独特的挑战。通过标准指标和人类评估,我们发现我们的数据集质量高、多样。
https://arxiv.org/abs/2409.10297
In recent years, diffusion models have revolutionized visual generation, outperforming traditional frameworks like Generative Adversarial Networks (GANs). However, generating images of humans with realistic semantic parts, such as hands and faces, remains a significant challenge due to their intricate structural complexity. To address this issue, we propose a novel post-processing solution named RealisHuman. The RealisHuman framework operates in two stages. First, it generates realistic human parts, such as hands or faces, using the original malformed parts as references, ensuring consistent details with the original image. Second, it seamlessly integrates the rectified human parts back into their corresponding positions by repainting the surrounding areas to ensure smooth and realistic blending. The RealisHuman framework significantly enhances the realism of human generation, as demonstrated by notable improvements in both qualitative and quantitative metrics. Code is available at this https URL.
近年来,扩散模型已经彻底颠覆了视觉生成,超过了诸如生成对抗网络(GANs)这样的传统框架。然而,生成具有真实语义部分的人类图像仍然是一个重要的挑战,因为它们具有复杂的结构复杂性。为解决这个问题,我们提出了名为RealisHuman的新后处理解决方案。RealisHuman框架分为两个阶段。第一阶段,它使用原始畸形部分作为参考生成真实的人类部位,确保与原始图像的细节保持一致。第二阶段,它通过重新上色周围区域将修正的人类部位无缝整合回其相应的位置,确保平滑和真实的融合。RealisHuman框架显著提高了人类生成技术的逼真度,如通过提高质量和数量指标的显著改善来说明。代码可以从这个链接下载:https://www.example.com/RealisHuman
https://arxiv.org/abs/2409.03644
We explore Bird's-Eye View (BEV) generation, converting a BEV map into its corresponding multi-view street images. Valued for its unified spatial representation aiding multi-sensor fusion, BEV is pivotal for various autonomous driving applications. Creating accurate street-view images from BEV maps is essential for portraying complex traffic scenarios and enhancing driving algorithms. Concurrently, diffusion-based conditional image generation models have demonstrated remarkable outcomes, adept at producing diverse, high-quality, and condition-aligned results. Nonetheless, the training of these models demands substantial data and computational resources. Hence, exploring methods to fine-tune these advanced models, like Stable Diffusion, for specific conditional generation tasks emerges as a promising avenue. In this paper, we introduce a practical framework for generating images from a BEV layout. Our approach comprises two main components: the Neural View Transformation and the Street Image Generation. The Neural View Transformation phase converts the BEV map into aligned multi-view semantic segmentation maps by learning the shape correspondence between the BEV and perspective views. Subsequently, the Street Image Generation phase utilizes these segmentations as a condition to guide a fine-tuned latent diffusion model. This finetuning process ensures both view and style consistency. Our model leverages the generative capacity of large pretrained diffusion models within traffic contexts, effectively yielding diverse and condition-coherent street view images.
我们探讨了Bird's-Eye View (BEV) 生成,将BEV图转换为相应的多视角街头图像。由于其统一的空间表示帮助多传感器融合,BEV对于各种自动驾驶应用至关重要。从BEV图创建准确的街头图像至关重要,以描绘复杂的交通场景并提高驾驶算法。同时,基于扩散的条件下图像生成模型已经取得了显著的成果,擅长产生多样、高品质和条件相关的结果。然而,这些模型的训练需要大量数据和计算资源。因此,探索为特定条件生成任务对先进模型进行微调的方法,如Stable Diffusion,被视为一个有前景的途径。 在本文中,我们介绍了一种从BEV布局生成图像的实用框架。我们的方法包括两个主要组件:神经视图变换和街头图像生成。神经视图变换阶段通过学习BEV和透视视图之间的形状匹配将BEV图转换为对齐的多视角语义分割图。然后,街头图像生成阶段利用这些分割图作为条件来指导微调的潜在扩散模型。这一微调过程确保了视力和风格的一致性。我们的模型利用交通背景下大型预训练扩散模型的生成能力,有效地产生了多样且条件相关的街头图像。
https://arxiv.org/abs/2409.01014
Text-to-image (T2I) diffusion models have demonstrated impressive capabilities in generating high-quality images given a text prompt. However, ensuring the prompt-image alignment remains a considerable challenge, i.e., generating images that faithfully align with the prompt's semantics. Recent works attempt to improve the faithfulness by optimizing the latent code, which potentially could cause the latent code to go out-of-distribution and thus produce unrealistic images. In this paper, we propose FRAP, a simple, yet effective approach based on adaptively adjusting the per-token prompt weights to improve prompt-image alignment and authenticity of the generated images. We design an online algorithm to adaptively update each token's weight coefficient, which is achieved by minimizing a unified objective function that encourages object presence and the binding of object-modifier pairs. Through extensive evaluations, we show FRAP generates images with significantly higher prompt-image alignment to prompts from complex datasets, while having a lower average latency compared to recent latent code optimization methods, e.g., 4 seconds faster than D&B on the COCO-Subject dataset. Furthermore, through visual comparisons and evaluation on the CLIP-IQA-Real metric, we show that FRAP not only improves prompt-image alignment but also generates more authentic images with realistic appearances. We also explore combining FRAP with prompt rewriting LLM to recover their degraded prompt-image alignment, where we observe improvements in both prompt-image alignment and image quality.
文本到图像(T2I)扩散模型在根据文本提示生成高质量图像方面具有令人印象深刻的潜力。然而,确保提示与图像对齐仍然是一个相当大的挑战,即生成与提示语义一致的图像。最近的工作试图通过优化潜在码来提高提示的忠实度,这可能导致潜在码超出分布,从而产生不现实的图像。在本文中,我们提出了一种简单而有效的方法——基于自适应调整每个词的提示权重来改善提示与图像对齐和生成图像的真诚度。我们设计了一个在线算法,通过最小化一个鼓励物体存在和物体词缀对之间的结合的统一目标函数来自适应地更新每个词的权重系数。通过广泛的评估,我们发现FRAP生成的图像与复杂数据集的提示图像对齐显著更高,而与最近的年份码优化方法相比,平均延迟较低,例如在COCO-Subject数据集上快4秒。此外,通过视觉比较和CLIP-IQA-Real指标的评估,我们发现FRAP不仅提高了提示图像对齐的真诚度,而且生成了更逼真的图像。我们还探索了将FRAP与提示重写LLM相结合以恢复其衰减的提示图像对齐,我们观察到提示图像对齐和图像质量都有所提高。
https://arxiv.org/abs/2408.11706
Portrait Fidelity Generation is a prominent research area in generative models, with a primary focus on enhancing both controllability and fidelity. Current methods face challenges in generating high-fidelity portrait results when faces occupy a small portion of the image with a low resolution, especially in multi-person group photo settings. To tackle these issues, we propose a systematic solution called MagicID, based on a self-constructed million-level multi-modal dataset named IDZoom. MagicID consists of Multi-Mode Fusion training strategy (MMF) and DDIM Inversion based ID Restoration inference framework (DIIR). During training, MMF iteratively uses the skeleton and landmark modalities from IDZoom as conditional guidance. By introducing the Clone Face Tuning in training stage and Mask Guided Multi-ID Cross Attention (MGMICA) in inference stage, explicit constraints on face positional features are achieved for multi-ID group photo generation. The DIIR aims to address the issue of artifacts. The DDIM Inversion is used in conjunction with face landmarks, global and local face features to achieve face restoration while keeping the background unchanged. Additionally, DIIR is plug-and-play and can be applied to any diffusion-based portrait generation method. To validate the effectiveness of MagicID, we conducted extensive comparative and ablation experiments. The experimental results demonstrate that MagicID has significant advantages in both subjective and objective metrics, and achieves controllable generation in multi-person scenarios.
肖像可靠性生成是一个在生成模型领域突出研究领域的知名领域,主要关注提高可控制性和可靠性。当面对低分辨率时,当前方法在生成高保真度肖像结果时存在挑战,尤其是在多人物群照设置中。为解决这些问题,我们提出了一个基于自构建的100万级多模态数据集IDZoom的系统解决方案,称为MagicID。MagicID包括多模态融合训练策略(MMF)和基于ID修复的DDIM Inversion。在训练过程中,MMF使用IDZoom中的骨架和关键点模态作为条件指导。通过引入训练阶段的人脸克隆和引导多ID照Cross注意力(MGMICA)以及在推理阶段使用DDIM Inversion,实现了对面部位置特征的显式约束,用于多ID照群照生成。 DIIR旨在解决伪影问题。将DDIM Inversion与面部关键点、全局和局部面部特征相结合,可以在保留背景不变的情况下实现面部修复。此外,DIIR是可插拔的,可以应用于任何基于扩散的肖像生成方法。为了验证MagicID的有效性,我们进行了广泛的比较和消融实验。实验结果表明,MagicID在主观和客观指标上具有显著优势,并在多人物群照场景中实现了可控制性的生成。
https://arxiv.org/abs/2408.09248
This paper introduces Multi-Garment Customized Model Generation, a unified framework based on Latent Diffusion Models (LDMs) aimed at addressing the unexplored task of synthesizing images with free combinations of multiple pieces of clothing. The method focuses on generating customized models wearing various targeted outfits according to different text prompts. The primary challenge lies in maintaining the natural appearance of the dressed model while preserving the complex textures of each piece of clothing, ensuring that the information from different garments does not interfere with each other. To tackle these challenges, we first developed a garment encoder, which is a trainable UNet copy with shared weights, capable of extracting detailed features of garments in parallel. Secondly, our framework supports the conditional generation of multiple garments through decoupled multi-garment feature fusion, allowing multiple clothing features to be injected into the backbone network, significantly alleviating conflicts between garment information. Additionally, the proposed garment encoder is a plug-and-play module that can be combined with other extension modules such as IP-Adapter and ControlNet, enhancing the diversity and controllability of the generated models. Extensive experiments demonstrate the superiority of our approach over existing alternatives, opening up new avenues for the task of generating images with multiple-piece clothing combinations
本文介绍了一种名为多件定制模型生成(Multi-Garment Customized Model Generation)的统一框架,基于潜在扩散模型(LDMs),旨在解决使用多种服装组合生成图像的未解决任务。该方法专注于根据不同的文本提示生成各种目标服装的定制模型。主要的挑战在于保持穿着模型的自然外观,同时保留每件服装的复杂纹理,确保不同服装的信息不互相干扰。为解决这些挑战,我们首先开发了一个服装编码器,是一个可训练的UNet副本,具有共享权重,能够并行提取服装的详细特征。其次,我们的框架通过解耦多件服装特征融合支持了 multiple garments 的条件生成,允许将多个服装特征注入到骨干网络中,显著减轻了服装信息之间的冲突。此外,所提出的服装编码器是一个可插拔的模块,可以与其他扩展模块如IP-Adapter和ControlNet 结合使用,提高生成的模型的多样性和可控性。大量实验证明,我们的方法相对于现有方法具有优越性,为生成多件服装组合的图像开辟了新的途径。
https://arxiv.org/abs/2408.05206
In this work, we present TextHarmony, a unified and versatile multimodal generative model proficient in comprehending and generating visual text. Simultaneously generating images and texts typically results in performance degradation due to the inherent inconsistency between vision and language modalities. To overcome this challenge, existing approaches resort to modality-specific data for supervised fine-tuning, necessitating distinct model instances. We propose Slide-LoRA, which dynamically aggregates modality-specific and modality-agnostic LoRA experts, partially decoupling the multimodal generation space. Slide-LoRA harmonizes the generation of vision and language within a singular model instance, thereby facilitating a more unified generative process. Additionally, we develop a high-quality image caption dataset, DetailedTextCaps-100K, synthesized with a sophisticated closed-source MLLM to enhance visual text generation capabilities further. Comprehensive experiments across various benchmarks demonstrate the effectiveness of the proposed approach. Empowered by Slide-LoRA, TextHarmony achieves comparable performance to modality-specific fine-tuning results with only a 2% increase in parameters and shows an average improvement of 2.5% in visual text comprehension tasks and 4.0% in visual text generation tasks. Our work delineates the viability of an integrated approach to multimodal generation within the visual text domain, setting a foundation for subsequent inquiries.
https://arxiv.org/abs/2407.16364
Avoiding systemic discrimination requires investigating AI models' potential to propagate stereotypes resulting from the inherent biases of training datasets. Our study investigated how text-to-image models unintentionally perpetuate non-rational beliefs regarding autism. The research protocol involved generating images based on 53 prompts aimed at visualizing concrete objects and abstract concepts related to autism across four models: DALL-E, Stable Diffusion, SDXL, and Midjourney (N=249). Expert assessment of results was performed via a framework of 10 deductive codes representing common stereotypes contested by the community regarding their presence and spatial intensity, quantified on ordinal scales and subject to statistical analysis of inter-rater reliability and size effects. The models frequently utilised controversial themes and symbols which were unevenly distributed, however, with striking homogeneity in terms of skin colour, gender, and age, with autistic individuals portrayed as engaged in solitary activities, interacting with objects rather than people, and displaying stereotypical emotional expressions such as pale, anger, or sad. Secondly we observed representational insensitivity regarding autism images despite directional prompting aimed at falsifying the above results. Additionally, DALL-E explicitly denied perpetuating stereotypes. We interpret this as ANNs mirroring the human cognitive architecture regarding the discrepancy between background and reflective knowledge, as justified by our previous research on autism-related stereotypes in humans.
https://arxiv.org/abs/2407.16292
Virtual Try-On (VTON) has become a transformative technology, empowering users to experiment with fashion without ever having to physically try on clothing. However, existing methods often struggle with generating high-fidelity and detail-consistent results. While diffusion models, such as Stable Diffusion series, have shown their capability in creating high-quality and photorealistic images, they encounter formidable challenges in conditional generation scenarios like VTON. Specifically, these models struggle to maintain a balance between control and consistency when generating images for virtual clothing trials. OutfitAnyone addresses these limitations by leveraging a two-stream conditional diffusion model, enabling it to adeptly handle garment deformation for more lifelike results. It distinguishes itself with scalability-modulating factors such as pose, body shape and broad applicability, extending from anime to in-the-wild images. OutfitAnyone's performance in diverse scenarios underscores its utility and readiness for real-world deployment. For more details and animated results, please see \url{this https URL}.
https://arxiv.org/abs/2407.16224
Text-to-image models are enabling efficient design space exploration, rapidly generating images from text prompts. However, many generative AI tools are imperfect for product design applications as they are not built for the goals and requirements of product design. The unclear link between text input and image output further complicates their application. This work empirically investigates design space exploration strategies that can successfully yield product images that are feasible, novel, and aesthetic, which are three common goals in product design. Specifically, user actions within the global and local editing modes, including their time spent, prompt length, mono vs. multi-criteria prompts, and goal orientation of prompts, are analyzed. Key findings reveal the pivotal role of mono vs. multi-criteria and goal orientation of prompts in achieving specific design goals over time and prompt length. The study recommends prioritizing the use of multi-criteria prompts for feasibility and novelty during global editing, while favoring mono-criteria prompts for aesthetics during local editing. Overall, this paper underscores the nuanced relationship between the AI-driven text-to-image models and their effectiveness in product design, urging designers to carefully structure prompts during different editing modes to better meet the unique demands of product design.
文本转图像模型正在实现高效的设计空间探索,并迅速生成基于文本提示的图像。然而,许多生成人工智能工具并不适合产品设计应用,因为它们不是为产品设计的目标和要求而设计的。文本输入与图像输出的不明确联系进一步复杂了它们的应用。本研究通过实证研究探讨了能够成功生成可行、新颖和美观的产品图像的设计空间探索策略,这是产品设计中常见的三种目标。具体来说,研究分析用户在全局和局部编辑模式下的操作,包括他们的时间、提示长度、单条件 vs. 多条件提示以及提示的目标方向。关键发现表明,单条件 vs. 多条件提示和提示的目标导向在实现特定设计目标方面起着关键作用,而提示的长度对目标的影响更加显著。研究建议在全局编辑模式中优先使用多条件提示以实现可行性和新颖性,而在局部编辑模式中优先使用单条件提示以实现美观性。总之,本文强调了AI驱动的文本转图像模型与产品设计之间的复杂关系,呼吁设计师在不同的编辑模式下仔细构建提示,更好地满足产品设计的独特需求。
https://arxiv.org/abs/2408.03946
Super-resolution reconstruction is aimed at generating images of high spatial resolution from low-resolution observations. State-of-the-art super-resolution techniques underpinned with deep learning allow for obtaining results of outstanding visual quality, but it is seldom verified whether they constitute a valuable source for specific computer vision applications. In this paper, we investigate the possibility of employing super-resolution as a preprocessing step to improve optical character recognition from document scans. To achieve that, we propose to train deep networks for single-image super-resolution in a task-driven way to make them better adapted for the purpose of text detection. As problems limited to a specific task are heavily ill-posed, we introduce a multi-task loss function that embraces components related with text detection coupled with those guided by image similarity. The obtained results reported in this paper are encouraging and they constitute an important step towards real-world super-resolution of document images.
超分辨率重建旨在从低分辨率观测中获得高空间分辨率的图像。支撑深度学习的最先进超分辨率技术允许获得出色的视觉效果,但很少验证它们是否具有针对特定计算机视觉应用程序的有价值的来源。在本文中,我们研究了使用超分辨率作为预处理步骤来改善文档扫描中的光学字符识别的可能性。为了实现这一目标,我们提出了以任务驱动的方式训练深度网络以实现单图像超分辨率,使它们更适合用于文本检测。由于针对特定任务的 problems 非常 ill-posed,我们引入了一个多任务损失函数,该函数包括与文本检测相关的组件以及由图像相似性引导的组件。本文报告的结果鼓舞人心,并构成了实现真实世界文档图像超分辨率的重要一步。
https://arxiv.org/abs/2407.08993
We introduce a new type of indirect injection vulnerabilities in language models that operate on images: hidden "meta-instructions" that influence how the model interprets the image and steer the model's outputs to express an adversary-chosen style, sentiment, or point of view. We explain how to create meta-instructions by generating images that act as soft prompts. Unlike jailbreaking attacks and adversarial examples, the outputs resulting from these images are plausible and based on the visual content of the image, yet follow the adversary's (meta-)instructions. We describe the risks of these attacks, including misinformation and spin, evaluate their efficacy for multiple visual language models and adversarial meta-objectives, and demonstrate how they can "unlock" the capabilities of the underlying language models that are unavailable via explicit text instructions. Finally, we discuss defenses against these attacks.
我们在语言模型上引入了一种新的间接注入漏洞:在图像上运行的语言模型中的隐藏"元指令",它们影响模型如何解释图像及其将模型的输出引导为具有敌方选择的风格、情感或观点。我们解释了如何通过生成作为软提示的图像来创建元指令。与破解攻击和恶意示例不同,这些图像产生的输出是可信的,并且基于图像的视觉内容,但是遵循了敌方的(元)指令。我们描述了这些攻击的风险,包括错误或扭曲,评估了它们对多个视觉语言模型和敌方元目标的效力,并展示了它们如何通过隐式文本指令无法实现的底层语言模型的能力"解锁"。最后,我们讨论了防御这些攻击的方法。
https://arxiv.org/abs/2407.08970
Text-to-image generative models have increasingly been used to assist designers during concept generation in various creative domains, such as graphic design, user interface design, and fashion design. However, their applications in engineering design remain limited due to the models' challenges in generating images of feasible designs concepts. To address this issue, this paper introduces a method that improves the design feasibility by prompting the generation with feasible CAD images. In this work, the usefulness of this method is investigated through a case study with a bike design task using an off-the-shelf text-to-image model, Stable Diffusion 2.1. A diverse set of bike designs are produced in seven different generation settings with varying CAD image prompting weights, and these designs are evaluated on their perceived feasibility and novelty. Results demonstrate that the CAD image prompting successfully helps text-to-image models like Stable Diffusion 2.1 create visibly more feasible design images. While a general tradeoff is observed between feasibility and novelty, when the prompting weight is kept low around 0.35, the design feasibility is significantly improved while its novelty remains on par with those generated by text prompts alone. The insights from this case study offer some guidelines for selecting the appropriate CAD image prompting weight for different stages of the engineering design process. When utilized effectively, our CAD image prompting method opens doors to a wider range of applications of text-to-image models in engineering design.
文本到图像生成模型在各个创意领域,如平面设计、用户界面设计和时尚设计,都逐渐被用于协助设计师进行创意构思。然而,在工程设计领域,这些模型的应用仍然受到生成可行设计概念的限制。为解决这个问题,本文介绍了一种通过提示生成具有可行CAD图像的方法来提高设计可行性的方法。在这项研究中,通过一个使用备用文本到图像模型的自行车设计任务,对这种方法的效果进行了调查。在七个不同的生成设置中,通过改变提示CAD图像的重量,生产了各种不同的自行车设计。这些设计在视觉上进行了评价,包括其实际可行性和新颖性。结果表明,CAD图像提示能够显著地提高诸如Stable Diffusion 2.1这样的文本到图像模型的设计可行性。虽然可行性和新颖性之间存在一个普遍的权衡,但当提示权重保持在0.35左右时,设计可行性显著提高,而新颖性仍然与仅使用文本提示生成的设计保持同步。这个案例研究的结果提供了一些关于选择不同阶段工程设计过程中适当的CAD图像提示重量的指导建议。当这些模型得到有效利用时,我们的CAD图像提示方法为工程设计文本到图像模型的应用打开了更广泛的应用空间。
https://arxiv.org/abs/2407.08675
In this paper, we show different fine-tuning methods for Stable Diffusion XL; this includes inference steps, and caption customization for each image to align with generating images in the style of a commercial 2D icon training set. We also show how important it is to properly define what "high-quality" really is especially for a commercial-use environment. As generative AI models continue to gain widespread acceptance and usage, there emerge many different ways to optimize and evaluate them for various applications. Specifically text-to-image models, such as Stable Diffusion XL and DALL-E 3 require distinct evaluation practices to effectively generate high-quality icons according to a specific style. Although some images that are generated based on a certain style may have a lower FID score (better), we show how this is not absolute in and of itself even for rasterized icons. While FID scores reflect the similarity of generated images to the overall training set, CLIP scores measure the alignment between generated images and their textual descriptions. We show how FID scores miss significant aspects, such as the minority of pixel differences that matter most in an icon, while CLIP scores result in misjudging the quality of icons. The CLIP model's understanding of "similarity" is shaped by its own training data; which does not account for feature variation in our style of choice. Our findings highlight the need for specialized evaluation metrics and fine-tuning approaches when generating high-quality commercial icons, potentially leading to more effective and tailored applications of text-to-image models in professional design contexts.
在本文中,我们展示了稳定扩散XL的不同微调方法;这包括推理步骤和针对每个图像的文本定制,使其符合商业2D图标训练集中的生成图像的风格。我们还展示了在商业环境中正确定义“高质量”的重要性。随着生成人工智能模型继续得到广泛接受和应用,出现了许多优化和评估方法,用于为各种应用优化和评估它们。具体来说,如文本到图像模型(如稳定扩散XL和DALL-E 3)需要明确的评估实践才能根据特定风格有效地生成高质量图标。尽管根据某种风格生成的图像中可能存在较低的FID分数(更好),但我们将展示这并不绝对,甚至对于浮动点图标。FID分数反映了生成图像与整个训练集的相似性,而CLIP分数衡量了生成图像与文本描述之间的对齐。我们展示了FID分数忽略了重要方面,例如 icon 中至关重要的少数像素差异,而CLIP分数则错误地判断了图标的质量。CLIP模型的“相似性”理解受其训练数据影响;这并没有考虑到我们选择风格的特征变化。我们的研究结果强调了在生成高质量商业图标时需要专门的评估指标和微调方法,这可能导致专业设计环境中文本到图像模型的更有效和定制化的应用。
https://arxiv.org/abs/2407.08513
Diffusion models have demonstrated impressive image generation capabilities. Personalized approaches, such as textual inversion and Dreambooth, enhance model individualization using specific images. These methods enable generating images of specific objects based on diverse textual contexts. Our proposed approach aims to retain the model's original knowledge during new information integration, resulting in superior outcomes while necessitating less training time compared to Dreambooth and textual inversion.
扩散模型已经展示了令人印象深刻的图像生成能力。个人化方法(如文本反演和Dreambooth)通过针对特定图像来增强模型的个性化程度。这些方法使用特定的图像来生成特定物体的图像,基于不同的文本上下文。我们提出的方法旨在在信息融合过程中保留模型的原始知识,从而在需要更少的训练时间的同时实现卓越的性能。
https://arxiv.org/abs/2407.05312
This paper documents our characterization study and practices for serving text-to-image requests with stable diffusion models in production. We first comprehensively analyze inference request traces for commercial text-to-image applications. It commences with our observation that add-on modules, i.e., ControlNets and LoRAs, that augment the base stable diffusion models, are ubiquitous in generating images for commercial applications. Despite their efficacy, these add-on modules incur high loading overhead, prolong the serving latency, and swallow up expensive GPU resources. Driven by our characterization study, we present SwiftDiffusion, a system that efficiently generates high-quality images using stable diffusion models and add-on modules. To achieve this, SwiftDiffusion reconstructs the existing text-to-image serving workflow by identifying the opportunities for parallel computation and distributing ControlNet computations across multiple GPUs. Further, SwiftDiffusion thoroughly analyzes the dynamics of image generation and develops techniques to eliminate the overhead associated with LoRA loading and patching while preserving the image quality. Last, SwiftDiffusion proposes specialized optimizations in the backbone architecture of the stable diffusion models, which are also compatible with the efficient serving of add-on modules. Compared to state-of-the-art text-to-image serving systems, SwiftDiffusion reduces serving latency by up to 5x and improves serving throughput by up to 2x without compromising image quality.
本文记录了我们在生产环境中使用稳定扩散模型处理文本到图像请求的分类和实现。我们首先对商业文本到图像应用的推理请求跟踪进行全面分析。我们的观察是,增强基稳定扩散模型的附加模块(即控制网络和LoRAs)无处不在,都在生成图像。尽管它们非常有效,但这些附加模块却带来了高昂的加载开销,延长了服务延迟,并吞噬了昂贵的GPU资源。为了实现我们的分类研究,我们提出了SwiftDiffusion系统,这是一种高效使用稳定扩散模型和附加模块生成高质量图像的系统。为了实现这一目标,SwiftDiffusion通过识别并分配计算机会并行执行控制网络计算,对现有的文本到图像服务流程进行重构。此外,SwiftDiffusion还深入分析了图像生成的动态,并开发了消除LoRA加载和补丁开销,同时保留图像质量的技术。最后,SwiftDiffusion在稳定扩散模型的基础架构上提出了专用的优化措施,这些措施与附加模块的高效服务是兼容的。与最先进的文本到图像服务系统相比,SwiftDiffusion将服务延迟减少到原来的5倍,同时将服务吞吐量提高到原来的2倍,而不会牺牲图像质量。
https://arxiv.org/abs/2407.02031
Customized image generation, which seeks to synthesize images with consistent characters, holds significant relevance for applications such as storytelling, portrait generation, and character design. However, previous approaches have encountered challenges in preserving characters with high-fidelity consistency due to inadequate feature extraction and concept confusion of reference characters. Therefore, we propose Character-Adapter, a plug-and-play framework designed to generate images that preserve the details of reference characters, ensuring high-fidelity consistency. Character-Adapter employs prompt-guided segmentation to ensure fine-grained regional features of reference characters and dynamic region-level adapters to mitigate concept confusion. Extensive experiments are conducted to validate the effectiveness of Character-Adapter. Both quantitative and qualitative results demonstrate that Character-Adapter achieves the state-of-the-art performance of consistent character generation, with an improvement of 24.8% compared with other methods
定制图像生成,旨在合成具有一致性字符的图像,对于诸如叙事、肖像生成和角色设计等应用具有重要的相关性。然而,由于参考角色的特征提取不足和参考角色概念混淆,以前的方法在保留高保真度特征方面遇到了挑战。因此,我们提出了Character-Adapter,一个可插拔和使用的框架,旨在生成保留参考角色详细信息的图像,确保高保真度一致性。Character-Adapter采用提示引导分割来确保参考角色的细粒度区域特征,并使用动态区域级别适应器来缓解概念混淆。 extensive实验验证了Character-Adapter的有效性。量化结果表明,Character-Adapter实现了与其他方法相当的最佳性能,性能提高了24.8%。
https://arxiv.org/abs/2406.16537
Text-to-image (T2I) diffusion models have shown exceptional capabilities in generating images that closely correspond to textual prompts. However, the advancement of T2I diffusion models presents significant risks, as the models could be exploited for malicious purposes, such as generating images with violence or nudity, or creating unauthorized portraits of public figures in inappropriate contexts. To mitigate these risks, concept removal methods have been proposed. These methods aim to modify diffusion models to prevent the generation of malicious and unwanted concepts. Despite these efforts, existing research faces several challenges: (1) a lack of consistent comparisons on a comprehensive dataset, (2) ineffective prompts in harmful and nudity concepts, (3) overlooked evaluation of the ability to generate the benign part within prompts containing malicious concepts. To address these gaps, we propose to benchmark the concept removal methods by introducing a new dataset, Six-CD, along with a novel evaluation metric. In this benchmark, we conduct a thorough evaluation of concept removals, with the experimental observations and discussions offering valuable insights in the field.
文本到图像(T2I)扩散模型在生成与文本提示非常相似的图像方面表现出色。然而,T2I扩散模型的进步也带来了显著的风险,因为这些模型可能被用于恶意目的,例如生成具有暴力和不雅内容的图像,或以不当的方式创建名人的未经授权的照片。为了减轻这些风险,提出了概念去除方法。这些方法旨在修改扩散模型,以防止生成恶意和不受欢迎的概念。尽管这些努力,现有研究仍面临几个挑战:(1)在全面数据集上缺乏一致的比较,(2)在有害和低俗概念上的提示效果不佳,(3)在包含恶意概念的提示中忽视了生成良性部分的能力评估。为了填补这些空白,我们提出了一个新的数据集Six-CD和新评价指标,对概念去除方法进行基准测试。在基准测试中,我们详细评估了概念去除效果,实验观察和讨论为该领域提供了宝贵的见解。
https://arxiv.org/abs/2406.14855
Talking head synthesis, an advanced method for generating portrait videos from a still image driven by specific content, has garnered widespread attention in virtual reality, augmented reality and game production. Recently, significant breakthroughs have been made with the introduction of novel models such as the transformer and the diffusion model. Current methods can not only generate new content but also edit the generated material. This survey systematically reviews the technology, categorizing it into three pivotal domains: portrait generation, driven mechanisms, and editing techniques. We summarize milestone studies and critically analyze their innovations and shortcomings within each domain. Additionally, we organize an extensive collection of datasets and provide a thorough performance analysis of current methodologies based on various evaluation metrics, aiming to furnish a clear framework and robust data support for future research. Finally, we explore application scenarios of talking head synthesis, illustrate them with specific cases, and examine potential future directions.
谈话头合成是一种先进的方法,用于从特定内容驱动的静态图像中生成视频肖像。它在虚拟现实、增强现实和游戏制作领域引起了广泛关注。近年来,随着新模型的引入,如Transformer和扩散模型,取得了显著的突破。现有的方法不仅能生成新的内容,还可以编辑生成材料。本调查系统地回顾了该技术,将其分为三个关键领域:肖像生成、驱动机制和编辑技术。我们总结了里程碑式的研究,并对其创新和不足进行了批判性分析。此外,我们组织了一组丰富的数据集,对各种评估指标进行了详细的表现分析,旨在为未来的研究提供一个清晰、可靠的框架和数据支持。最后,我们探讨了谈话头合成应用场景,用具体案例进行了说明,并探讨了未来的研究方向。
https://arxiv.org/abs/2406.10553
We interpret the function of individual neurons in CLIP by automatically describing them using text. Analyzing the direct effects (i.e. the flow from a neuron through the residual stream to the output) or the indirect effects (overall contribution) fails to capture the neurons' function in CLIP. Therefore, we present the "second-order lens", analyzing the effect flowing from a neuron through the later attention heads, directly to the output. We find that these effects are highly selective: for each neuron, the effect is significant for <2% of the images. Moreover, each effect can be approximated by a single direction in the text-image space of CLIP. We describe neurons by decomposing these directions into sparse sets of text representations. The sets reveal polysemantic behavior - each neuron corresponds to multiple, often unrelated, concepts (e.g. ships and cars). Exploiting this neuron polysemy, we mass-produce "semantic" adversarial examples by generating images with concepts spuriously correlated to the incorrect class. Additionally, we use the second-order effects for zero-shot segmentation and attribute discovery in images. Our results indicate that a scalable understanding of neurons can be used for model deception and for introducing new model capabilities.
我们通过自动描述CLIP中单个神经元的功能来解释其功能。分析直接影响(即神经元通过残差流到输出)或间接影响(总贡献)无法捕捉到CLIP中神经元的功能。因此,我们提出了“二级镜头”,通过分析神经元通过后注意力的流动直接到输出的情况。我们发现这些影响非常选择性:对于每个神经元,影响在 <2% 的图像上显著。此外,每个影响都可以在CLIP的文本图像空间中近似为一个方向。我们通过分解这些方向为稀疏的文本表示来描述神经元。这些集揭示了多义词行为 - 每个神经元对应多个,通常不相关的概念(例如船只和汽车)。利用这种神经元多义词,我们通过生成与错误分类相关概念 spuriously correlation的图像来大规模生产“语义”对抗样本。此外,我们还使用第二级效果进行图像零散分割和特征提取。我们的结果表明,对于神经元的可扩展理解可用于模型欺骗和引入新模型功能。
https://arxiv.org/abs/2406.04341