Recent text-to-image (T2I) generation models have demonstrated impressive capabilities in creating images from text descriptions. However, these T2I generation models often fall short of generating images that precisely match the details of the text inputs, such as incorrect spatial relationship or missing objects. In this paper, we introduce SELMA: Skill-Specific Expert Learning and Merging with Auto-Generated Data, a novel paradigm to improve the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets, with skill-specific expert learning and merging. First, SELMA leverages an LLM's in-context learning capability to generate multiple datasets of text prompts that can teach different skills, and then generates the images with a T2I model based on the prompts. Next, SELMA adapts the T2I model to the new skills by learning multiple single-skill LoRA (low-rank adaptation) experts followed by expert merging. Our independent expert fine-tuning specializes multiple models for different skills, and expert merging helps build a joint multi-skill T2I model that can generate faithful images given diverse text prompts, while mitigating the knowledge conflict from different datasets. We empirically demonstrate that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks (+2.1% on TIFA and +6.9% on DSG), human preference metrics (PickScore, ImageReward, and HPS), as well as human evaluation. Moreover, fine-tuning with image-text pairs auto-collected via SELMA shows comparable performance to fine-tuning with ground truth data. Lastly, we show that fine-tuning with images from a weaker T2I model can help improve the generation quality of a stronger T2I model, suggesting promising weak-to-strong generalization in T2I models.
近年来,从文本描述创建图像(T2I)的模型已经展示了在从文本描述创建准确图像方面出色的能力。然而,这些T2I生成模型通常无法生成与文本输入完全匹配的图像,例如错误的空间关系或缺失的对象。在本文中,我们介绍了SELMA:专长特定专家学习与自动生成数据合并的新范式,通过在自动生成、多技能图像-文本数据集上微调模型来提高T2I模型的可靠性,包括专长特定专家学习和合并。首先,SELMA利用LLM的上下文学习能力生成多个技能的多文本提示,以教导不同的技能,然后基于提示生成图像。接下来,SELMA根据提示调整T2I模型以学习多个单技能LoRA(低秩适应)专家,然后进行专家合并。我们独立专家微调为不同技能的多个模型,专家合并有助于构建多技能T2I模型,在多样文本提示下生成准确图像,同时减轻不同数据集之间的知识冲突。我们通过实验证明了SELMA显著提高了最先进的T2I扩散模型的语义对齐度和文本可靠性(+2.1%在TIFA和+6.9%在DSG),人类偏好度指标(PickScore,图像奖励和HPS)以及人类评估。此外,通过SELMA自动收集图像-文本对的结果与通过地面真实数据进行微调的性能相当。最后,我们证明了用较弱的T2I模型进行微调可以提高较强T2I模型的生成质量,这表明了T2I模型的弱-到强的泛化具有前景。
https://arxiv.org/abs/2403.06952
Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements. To further improve T2I models' capability in numerical and spatial reasoning, the layout is employed as an intermedium to bridge large language models and layout-based diffusion models. However, these methods still struggle with generating images from textural prompts with multiple objects and complicated spatial relationships. To tackle this challenge, we introduce a divide-and-conquer approach which decouples the T2I generation task into simple subtasks. Our approach divides the layout prediction stage into numerical \& spatial reasoning and bounding box prediction. Then, the layout-to-image generation stage is conducted in an iterative manner to reconstruct objects from easy ones to difficult ones. We conduct experiments on the HRS and NSR-1K benchmarks and our approach outperforms previous state-of-the-art models with notable margins. In addition, visual results demonstrate that our approach significantly improves the controllability and consistency in generating multiple objects from complex textural prompts.
扩散式文本到图像(T2I)生成取得了显著的进步。为了进一步提高T2I模型的数值和空间推理能力,我们采用了布局作为中间媒介来连接大型语言模型和基于布局的扩散模型。然而,这些方法仍然难以处理文本纹理提示的多物体和复杂的空间关系。为了解决这个问题,我们引入了一种分而治之的方法,将T2I生成任务拆分成简单的子任务。我们的方法将布局预测阶段分为数值与空间推理以及边界框预测。然后,以迭代方式进行布局到图像生成阶段,从简单的物体到困难的物体进行重建。我们在HRS和NSR-1K基准上进行了实验,我们的方法与最先进的模型相比具有显著的领先优势。此外,视觉结果表明,我们的方法显著提高了从复杂纹理提示中生成多个对象的可控性和一致性。
https://arxiv.org/abs/2403.06400
The rapid expansion of large-scale text-to-image diffusion models has raised growing concerns regarding their potential misuse in creating harmful or misleading content. In this paper, we introduce MACE, a finetuning framework for the task of mass concept erasure. This task aims to prevent models from generating images that embody unwanted concepts when prompted. Existing concept erasure methods are typically restricted to handling fewer than five concepts simultaneously and struggle to find a balance between erasing concept synonyms (generality) and maintaining unrelated concepts (specificity). In contrast, MACE differs by successfully scaling the erasure scope up to 100 concepts and by achieving an effective balance between generality and specificity. This is achieved by leveraging closed-form cross-attention refinement along with LoRA finetuning, collectively eliminating the information of undesirable concepts. Furthermore, MACE integrates multiple LoRAs without mutual interference. We conduct extensive evaluations of MACE against prior methods across four different tasks: object erasure, celebrity erasure, explicit content erasure, and artistic style erasure. Our results reveal that MACE surpasses prior methods in all evaluated tasks. Code is available at this https URL.
大规模文本到图像扩散模型的快速扩张引发了对它们在创建有害或误导性内容方面的潜在滥用担忧。在本文中,我们介绍了MACE,一个用于大型概念消除任务的微调框架。该任务旨在防止模型在提示下生成具有不良概念的图像。现有的概念消除方法通常仅处理五个或更少的概念,并且很难在消除概念同义词(普遍性)和保持无关概念(特异性)之间找到平衡。相比之下,MACE通过成功将消除范围扩展到100个概念,并实现普遍性和特异性之间的有效平衡,实现了这一目标。这是通过利用闭式形式跨注意力和LoRA微调来共同消除不良概念信息实现的。此外,MACE集成了多个LoRA,没有相互干扰。我们对MACE与先前方法在四个不同任务上的表现进行了广泛评估:对象消除、名人消除、明确内容消除和艺术风格消除。我们的结果表明,MACE在所有评估任务上都超过了先前方法。代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2403.06135
Text-to-image diffusion models (T2I) use a latent representation of a text prompt to guide the image generation process. However, the process by which the encoder produces the text representation is unknown. We propose the Diffusion Lens, a method for analyzing the text encoder of T2I models by generating images from its intermediate representations. Using the Diffusion Lens, we perform an extensive analysis of two recent T2I models. Exploring compound prompts, we find that complex scenes describing multiple objects are composed progressively and more slowly compared to simple scenes; Exploring knowledge retrieval, we find that representation of uncommon concepts requires further computation compared to common concepts, and that knowledge retrieval is gradual across layers. Overall, our findings provide valuable insights into the text encoder component in T2I pipelines.
文本到图像扩散模型(T2I)使用文本提示的潜在表示来指导图像生成过程。然而,编码器产生文本表示的过程是未知的。我们提出了一种名为扩散镜头的方法,通过从其中间表示生成图像来分析T2I模型的文本编码器。使用扩散镜头,我们对两个最近的T2I模型进行了广泛的分析。探索组合提示,我们发现复杂场景描述多个对象的逐渐和缓慢组成与简单场景相比;探索知识检索,我们发现与常见概念相比,不寻常概念的表示需要进一步计算,并且知识检索在层次结构中是逐渐发展的。总体而言,我们的研究结果为T2I模型的文本编码器部件提供了宝贵的洞见。
https://arxiv.org/abs/2403.05846
In this paper, we introduce PixArt-\Sigma, a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution. PixArt-\Sigma represents a significant advancement over its predecessor, PixArt-\alpha, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-\Sigma is its training efficiency. Leveraging the foundational pre-training of PixArt-\alpha, it evolves from the `weaker' baseline to a `stronger' model via incorporating higher quality data, a process we term "weak-to-strong training". The advancements in PixArt-\Sigma are twofold: (1) High-Quality Training Data: PixArt-\Sigma incorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt-\Sigma achieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt-\Sigma's capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of high-quality visual content in industries such as film and gaming.
在本文中,我们提出了PixArt-Sigma,一种能够在4K分辨率下直接生成图像的扩散Transformer模型(DiT)。PixArt-Sigma在它的前继模型PixArt-alpha的基础上取得了显著的进步,提供了更高精度和更精确的文本提示下的图像。PixArt-Sigma的一个关键特征是它的训练效率。利用PixArt-alpha的基本预训练,它通过引入更高质量的数据,经历了一个“弱-到强”的训练过程,进化成为一种更强的模型。PixArt-Sigma的进步是双重的:(1)高质量训练数据:PixArt-Sigma配备了卓越的图像数据,与更精确和详细的图像标题相结合。(2)高效的键值压缩:我们提出了一种新的注意力模块,位于DiT框架内,压缩键和值,显著提高了效率,并促进了超高清图像生成。由于这些改进,PixArt-Sigma在比现有文本到图像扩散模型(如SDXL,2.6B参数)更小的模型大小(0.6B参数)下实现了卓越的图像质量和用户提示的遵循能力。此外,PixArt-Sigma生成4K图像的能力支持了创建高分辨率海报和壁纸,有效地推动了在电影和游戏等行业生产高质量视觉内容。
https://arxiv.org/abs/2403.04692
Recent advancement in text-to-image models (e.g., Stable Diffusion) and corresponding personalized technologies (e.g., DreamBooth and LoRA) enables individuals to generate high-quality and imaginative images. However, they often suffer from limitations when generating images with resolutions outside of their trained domain. To overcome this limitation, we present the Resolution Adapter (ResAdapter), a domain-consistent adapter designed for diffusion models to generate images with unrestricted resolutions and aspect ratios. Unlike other multi-resolution generation methods that process images of static resolution with complex post-process operations, ResAdapter directly generates images with the dynamical resolution. Especially, after learning a deep understanding of pure resolution priors, ResAdapter trained on the general dataset, generates resolution-free images with personalized diffusion models while preserving their original style domain. Comprehensive experiments demonstrate that ResAdapter with only 0.5M can process images with flexible resolutions for arbitrary diffusion models. More extended experiments demonstrate that ResAdapter is compatible with other modules (e.g., ControlNet, IP-Adapter and LCM-LoRA) for image generation across a broad range of resolutions, and can be integrated into other multi-resolution model (e.g., ElasticDiffusion) for efficiently generating higher-resolution images. Project link is this https URL
近年来,文本到图像模型(例如,Stable Diffusion)及其相应个性化技术(例如,DreamBooth 和 LoRA)的进步使得个人能够生成高质量和富有想象力的图像。然而,在生成训练领域之外的高分辨率图像时,它们往往存在局限性。为了克服这一局限,我们提出了分辨率适配器(ResAdapter),一种针对扩散模型的领域一致的适配器,用于生成无限制分辨率和平衡比的图像。与其他多分辨率生成方法不同,ResAdapter直接生成具有动态分辨率的图像。特别是在学习了对纯分辨率 prior 的深刻理解后,ResAdapter在训练通用数据集的同时,使用个性化的扩散模型生成具有个人风格域的分辨率无限制的图像。全面的实验证明,ResAdapter仅需0.5M即可处理任意扩散模型的灵活分辨率图像。更广泛的实验证明,ResAdapter与其他模块(例如,控制网、IP-适配器和LCM-LoRA)在各种分辨率范围内生成图像兼容,并可以集成到其他多分辨率模型(例如,ElasticDiffusion)中,以高效生成高分辨率图像。项目链接是:<https://www.projectlink.io/project/resadapter>
https://arxiv.org/abs/2403.02084
Text-to-image generative models can generate high-quality humans, but realism is lost when generating hands. Common artifacts include irregular hand poses, shapes, incorrect numbers of fingers, and physically implausible finger orientations. To generate images with realistic hands, we propose a novel diffusion-based architecture called HanDiffuser that achieves realism by injecting hand embeddings in the generative process. HanDiffuser consists of two components: a Text-to-Hand-Params diffusion model to generate SMPL-Body and MANO-Hand parameters from input text prompts, and a Text-Guided Hand-Params-to-Image diffusion model to synthesize images by conditioning on the prompts and hand parameters generated by the previous component. We incorporate multiple aspects of hand representation, including 3D shapes and joint-level finger positions, orientations and articulations, for robust learning and reliable performance during inference. We conduct extensive quantitative and qualitative experiments and perform user studies to demonstrate the efficacy of our method in generating images with high-quality hands.
文本到图像生成模型可以生成高质量的人类,但在生成手部时会丢失真实性。常见的问题包括不规则的手姿势、形状、手指数量的不正确以及不现实的手指方向。为了生成具有现实手部的图像,我们提出了一个新颖的扩散基础架构,称为HanDiffuser,通过在生成过程中注入手部嵌入来实现真实性。HanDiffuser由两个组件组成:一个从输入文本提示中生成SMPL-Body和MANO-Hand参数的Text-to-Hand-Params扩散模型,和一个根据先前的组件生成的提示和手部参数合成图像的Text-Guided Hand-Params-to-Image扩散模型。我们结合了多个手部表示方面,包括3D形状和关节级别的手指位置、方向和关节活动,以实现稳健的学习和可靠的推理性能。我们进行了广泛的定量实验和用户研究,以证明我们方法在生成高质量手部图像方面的有效性。
https://arxiv.org/abs/2403.01693
The class-conditional image generation based on diffusion models is renowned for generating high-quality and diverse images. However, most prior efforts focus on generating images for general categories, e.g., 1000 classes in ImageNet-1k. A more challenging task, large-scale fine-grained image generation, remains the boundary to explore. In this work, we present a parameter-efficient strategy, called FineDiffusion, to fine-tune large pre-trained diffusion models scaling to large-scale fine-grained image generation with 10,000 categories. FineDiffusion significantly accelerates training and reduces storage overhead by only fine-tuning tiered class embedder, bias terms, and normalization layers' parameters. To further improve the image generation quality of fine-grained categories, we propose a novel sampling method for fine-grained image generation, which utilizes superclass-conditioned guidance, specifically tailored for fine-grained categories, to replace the conventional classifier-free guidance sampling. Compared to full fine-tuning, FineDiffusion achieves a remarkable 1.56x training speed-up and requires storing merely 1.77% of the total model parameters, while achieving state-of-the-art FID of 9.776 on image generation of 10,000 classes. Extensive qualitative and quantitative experiments demonstrate the superiority of our method compared to other parameter-efficient fine-tuning methods. The code and more generated results are available at our project website: this https URL.
基于扩散模型的类条件图像生成以生成高质量和多样化的图像而闻名。然而,大多数努力都集中在为一般类别生成图像上,例如ImageNet-1k中的1000个类别。一个更具挑战性的任务是大规模细粒度图像生成,这是我们需要探索的边界。在这篇工作中,我们提出了一个参数高效的策略,称为FineDiffusion,用于在具有10,000个类别的较大预训练扩散模型上进行微调,实现大规模细粒度图像生成。FineDiffusion通过仅微调级联分类器、偏置层和归一化层的参数,显著加速训练并减少了存储开销。为了进一步提高细粒度类别的图像生成质量,我们提出了一个新的细粒度图像生成采样方法,利用超类条件指导,特别是为细粒度类别定制的,以取代传统的分类器无指导采样。与完全微调相比,FineDiffusion实现了训练速度的1.56倍,仅需要存储总模型的1.77%的参数,同时实现了与细粒度类别生成图像的FID达到9.776的 state-of-the-art水平。大量的定性和定量实验证明了我们的方法与其他参数高效的细粒度调整方法相比具有优越性。代码和更多生成结果可以在我们的项目网站上查看:https://this URL。
https://arxiv.org/abs/2402.18331
Precision devices play an important role in enhancing production quality and productivity in agricultural systems. Therefore, the optimization of these devices is essential in precision agriculture. Recently, with the advancements of deep learning, there have been several studies aiming to harness its capabilities for improving spray system performance. However, the effectiveness of these methods heavily depends on the size of the training dataset, which is expensive and time-consuming to collect. To address the challenge of insufficient training samples, this paper proposes an alternative solution by generating artificial images of droplets using generative adversarial networks (GAN). The GAN model is trained by using a small dataset captured by a high-speed camera and capable of generating images with progressively increasing resolution. The results demonstrate that the model can generate high-quality images with the size of $1024\times1024$. Furthermore, this research leverages recent advancements in computer vision and deep learning to develop a light droplet detector using the synthetic dataset. As a result, the detection model achieves a 16.06\% increase in mean average precision (mAP) when utilizing the synthetic dataset. To the best of our knowledge, this work stands as the first to employ a generative model for augmenting droplet detection. Its significance lies not only in optimizing nozzle design for constructing efficient spray systems but also in addressing the common challenge of insufficient data in various precision agriculture tasks. This work offers a critical contribution to conserving resources while striving for optimal and sustainable agricultural practices.
精密设备在农业系统中提高生产质量和生产力的作用非常重要。因此,优化这些设备对于精准农业来说至关重要。近年来,随着深度学习的进步,已经有一些研究试图利用其能力来提高喷雾系统性能。然而,这些方法的有效性很大程度上取决于训练数据集的大小,这需要花费大量的时间和金钱来收集。为解决训练样本不足的问题,本文提出了一种通过生成对抗网络(GAN)生成雾滴的人工图像来替代现有方案的解决方案。GAN模型通过使用高速相机捕获的小数据集进行训练,能够生成具有逐渐增加分辨率的图像。结果显示,该模型可以生成$1024\times1024$大小的优质图像。此外,本文利用计算机视觉和深度学习的最新进展,开发了一种使用合成数据集的轻型雾滴检测器。结果表明,当利用合成数据集时,检测模型平均精准度(mAP)增长了16.06%。据我们所知,这项工作是第一个采用生成模型来增强雾滴检测的。其重要性不仅在于优化雾滴喷嘴设计以构建高效的喷雾系统,而且在于解决各种精准农业任务中数据不足的常见挑战。这项工作为在寻求 optimal and sustainable agricultural practices的同时保护资源做出了重要的贡献。
https://arxiv.org/abs/2402.15909
Contemporary models for generating images show remarkable quality and versatility. Swayed by these advantages, the research community repurposes them to generate videos. Since video content is highly redundant, we argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability. In this work, we build Snap Video, a video-first model that systematically addresses these challenges. To do that, we first extend the EDM framework to take into account spatially and temporally redundant pixels and naturally support video generation. Second, we show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead. Hence, we propose a new transformer-based architecture that trains 3.31 times faster than U-Nets (and is ~4.5 faster at inference). This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity. The user studies showed that our model was favored by a large margin over the most recent methods. See our website at this https URL.
当代生成图像的模型具有出色的质量和多样性。受到这些优势的启发,研究社区将它们用于生成视频。由于视频内容高度冗余,我们认为过于简单地将图像模型的进步应用到视频生成领域会降低运动质量、视觉质量和可扩展性。在这项工作中,我们构建了Snap Video,一种视频优先的模型,系统地解决这些挑战。为此,我们首先将EDM框架扩展到考虑空间和时间冗余的像素,并自然支持视频生成。然后,我们证明了当生成视频时,U-Net - 负责图像生成的关键技术 - 表现不佳,需要大量的计算开销。因此,我们提出了一个基于Transformer的新架构,该架构训练速度比U-Nets快3.31倍(并且在推理时约快4.5倍)。这使得我们能够高效地训练具有数十亿参数的文本到视频模型,达到一些基准测试的最好结果,并生成具有相当高的质量、时间和运动复杂性的视频。用户研究表明,我们的模型在最近的方法中优势明显。请查看我们的网站:https://www.thisurl.com。
https://arxiv.org/abs/2402.14797
Nature is infinitely resolution-free. In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain. To overcome this limitation, we present the Flexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with unrestricted resolutions and aspect ratios. Unlike traditional methods that perceive images as static-resolution grids, FiT conceptualizes images as sequences of dynamically-sized tokens. This perspective enables a flexible training strategy that effortlessly adapts to diverse aspect ratios during both training and inference phases, thus promoting resolution generalization and eliminating biases induced by image cropping. Enhanced by a meticulously adjusted network structure and the integration of training-free extrapolation techniques, FiT exhibits remarkable flexibility in resolution extrapolation generation. Comprehensive experiments demonstrate the exceptional performance of FiT across a broad range of resolutions, showcasing its effectiveness both within and beyond its training resolution distribution. Repository available at this https URL.
自然无限分辨率。在这个现实世界的背景下,经过训练的扩散模型(如Diffusion Transformers)在处理超出其训练领域图像分辨率的问题时常常面临挑战。为了克服这一限制,我们提出了灵活视觉Transformer(FiT),一种专为生成无限制分辨率和大小的图像而设计的Transformer架构。与传统方法将图像视为静态分辨率网格的不同,FiT将图像视为大小可变的数据序列。这种观点使得在训练和推理阶段都能轻松适应各种 aspect ratios,从而促进了分辨率泛化并消除了由图像裁剪引起的偏差。通过精心调整网络结构和集成无训练扩展技术,FiT在分辨率扩展生成方面表现出非凡的灵活性。全面的实验证明,FiT在广泛的分辨率范围内都表现出优异的性能,证明了其在其训练分辨率分布之外的有效性。您可以在此链接的仓库中访问FiT:https://github.com/your_username/your_project_name
https://arxiv.org/abs/2402.12376
Text-to-Image (T2I) models have shown great performance in generating images based on textual prompts. However, these models are vulnerable to unsafe input to generate unsafe content like sexual, harassment and illegal-activity images. Existing studies based on image checker, model fine-tuning and embedding blocking are impractical in real-world applications. Hence, \textit{we propose the first universal prompt optimizer for safe T2I generation in black-box scenario}. We first construct a dataset consisting of toxic-clean prompt pairs by GPT-3.5 Turbo. To guide the optimizer to have the ability of converting toxic prompt to clean prompt while preserving semantic information, we design a novel reward function measuring toxicity and text alignment of generated images and train the optimizer through Proximal Policy Optimization. Experiments show that our approach can effectively reduce the likelihood of various T2I models in generating inappropriate images, with no significant impact on text alignment. It is also flexible to be combined with methods to achieve better performance.
文本到图像(T2I)模型在基于文本提示生成图像方面表现出色。然而,这些模型容易受到生成不安全内容的攻击,如性骚扰和非法活动图像。现有基于图像检查器、模型微调和高置信度嵌入的studies在现实应用中是行不通的。因此,我们提出了第一个在黑盒场景下实现通用提示优化的T2I模型。我们首先通过GPT-3.5 Turbo构建了一个包含毒性干净提示对的 dataset。为了指导优化器具有将毒性提示转换为干净提示的能力,同时保留语义信息,我们设计了一个新的奖励函数来衡量生成图像的毒性及文本对齐,并通过Proximal Policy Optimization训练优化器。实验证明,我们的方法可以有效降低各种T2I模型生成不适当图像的概率,而不会对文本对齐产生显著影响。它还可以与方法结合以实现更好的性能。
https://arxiv.org/abs/2402.10882
Diffusion models have proven to be highly effective in image and video generation; however, they still face composition challenges when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models for higher resolution demands substantial computational and optimization resources, yet achieving a generation capability comparable to low-resolution models remains elusive. This paper proposes a novel self-cascade diffusion model that leverages the rich knowledge gained from a well-trained low-resolution model for rapid adaptation to higher-resolution image and video generation, employing either tuning-free or cheap upsampler tuning paradigms. Integrating a sequence of multi-scale upsampler modules, the self-cascade diffusion model can efficiently adapt to a higher resolution, preserving the original composition and generation capabilities. We further propose a pivot-guided noise re-schedule strategy to speed up the inference process and improve local structural details. Compared to full fine-tuning, our approach achieves a 5X training speed-up and requires only an additional 0.002M tuning parameters. Extensive experiments demonstrate that our approach can quickly adapt to higher resolution image and video synthesis by fine-tuning for just 10k steps, with virtually no additional inference time.
扩散模型已经在图像和视频生成方面证明高度有效;然而,当生成不同尺寸的图像时,由于单尺度训练数据,它们仍然面临着构图挑战。为适应高分辨率图像和视频生成,将大型预训练扩散模型扩展到更高的分辨率需要大量的计算和优化资源,然而,实现与低分辨率模型相当的水平仍然具有挑战性。本文提出了一种新型的自级联扩散模型,它利用了预训练低分辨率模型所获得的有用知识,以快速适应高分辨率图像和视频生成,采用调整免费或便宜的升采样范式。将一系列多尺度升采样模块集成到自级联扩散模型中,它可以有效地适应高分辨率,保留原始构图和生成能力。我们还提出了一种引导式噪声重新调度策略,以加速推理过程并改善局部结构细节。与完全重训练相比,我们的方法实现了5倍的学习速度,只需要额外的0.002M个调整参数。大量实验证明,通过仅微调10k步,我们的方法可以快速适应高分辨率图像和视频合成,几乎不需要额外的推理时间。
https://arxiv.org/abs/2402.10491
Diffusion-based image generation models such as DALL-E 3 and Stable Diffusion-XL demonstrate remarkable capabilities in generating images with realistic and unique compositions. Yet, these models are not robust in precisely reasoning about physical and spatial configurations of objects, especially when instructed with unconventional, thereby out-of-distribution descriptions, such as "a chair with five legs". In this paper, we propose a language agent with chain-of-3D-thoughts (L3GO), an inference-time approach that can reason about part-based 3D mesh generation of unconventional objects that current data-driven diffusion models struggle with. More concretely, we use large language models as agents to compose a desired object via trial-and-error within the 3D simulation environment. To facilitate our investigation, we develop a new benchmark, Unconventionally Feasible Objects (UFO), as well as SimpleBlenv, a wrapper environment built on top of Blender where language agents can build and compose atomic building blocks via API calls. Human and automatic GPT-4V evaluations show that our approach surpasses the standard GPT-4 and other language agents (e.g., ReAct and Reflexion) for 3D mesh generation on ShapeNet. Moreover, when tested on our UFO benchmark, our approach outperforms other state-of-the-art text-to-2D image and text-to-3D models based on human evaluation.
扩散基图像生成模型,如DALL-E 3 和 Stable Diffusion-XL,在生成具有真实感和独特组成的图像方面表现出非凡的能力。然而,这些模型在精确推理物体物理和空间配置时并不稳健,尤其是在使用不寻常的、非规范的指令时,从而导致分布外推。在本文中,我们提出了一个3D思考链(L3GO)语言代理,这是一种推理基于部分3D网格生成非规范物体的方法。具体来说,我们使用大型语言模型作为代理,在3D仿真环境中通过尝试和错误来合成所需对象。为了方便我们的研究,我们还开发了一个名为Unconventional Feasible Objects(UFO)的新基准,以及 SimpleBlenv,一个基于Blender的封装环境,其中语言代理可以通过API调用构建和组合原子构建块。人类和自动GPT-4V评估表明,我们的方法超越了标准GPT-4和其他语言模型(如ReAct和Reflexion)在ShapeNet上3D网格生成方面的标准。此外,当我们的UFO基准被测试时,我们的方法在其他基于人类评估的2D图像和2D-3D模型上均表现优异。
https://arxiv.org/abs/2402.09052
Guided image synthesis methods, like SDEdit based on the diffusion model, excel at creating realistic images from user inputs such as stroke paintings. However, existing efforts mainly focus on image quality, often overlooking a key point: the diffusion model represents a data distribution, not individual images. This introduces a low but critical chance of generating images that contradict user intentions, raising ethical concerns. For example, a user inputting a stroke painting with female characteristics might, with some probability, get male faces from SDEdit. To expose this potential vulnerability, we aim to build an adversarial attack forcing SDEdit to generate a specific data distribution aligned with a specified attribute (e.g., female), without changing the input's attribute characteristics. We propose the Targeted Attribute Generative Attack (TAGA), using an attribute-aware objective function and optimizing the adversarial noise added to the input stroke painting. Empirical studies reveal that traditional adversarial noise struggles with TAGA, while natural perturbations like exposure and motion blur easily alter generated images' attributes. To execute effective attacks, we introduce FoolSDEdit: We design a joint adversarial exposure and blur attack, adding exposure and motion blur to the stroke painting and optimizing them together. We optimize the execution strategy of various perturbations, framing it as a network architecture search problem. We create the SuperPert, a graph representing diverse execution strategies for different perturbations. After training, we obtain the optimized execution strategy for effective TAGA against SDEdit. Comprehensive experiments on two datasets show our method compelling SDEdit to generate a targeted attribute-aware data distribution, significantly outperforming baselines.
指导图像合成方法,如基于扩散模型的SDEdit,在从用户输入的绘笔画创建逼真的图像方面表现出色。然而,现有努力主要关注图像质量,往往忽视了一个关键点:扩散模型表示数据分布,而不是单个图像。这导致生成图像与用户意图相矛盾的可能性较低,但存在伦理问题。例如,用户输入具有女性特征的绘笔画,在SDEdit中,有一定概率会生成具有男性特征的图像。为了揭示这个潜在的安全漏洞,我们旨在建立一个对抗攻击,迫使SDEdit生成与指定属性(例如女性)相符的特定数据分布,同时不改变输入的属性特征。我们提出了Targeted Attribute Generative Attack(TAGA),使用具有属性的目标函数和优化输入绘笔画的对抗噪声。实验研究表明,传统的对抗噪声很难与TAGA相比,而自然扰动(例如曝光和模糊)很容易改变生成的图像的属性。为了有效地执行攻击,我们引入了FoolSDEdit:我们设计了一个联合对抗曝光和模糊攻击,将曝光和模糊添加到绘笔画中,并一起优化它们。我们优化了各种扰动的执行策略,将其封装为网络架构搜索问题。我们创建了SuperPert,表示不同扰动执行策略的图形。在训练之后,我们获得了有效TAGA对SDEdit的优化执行策略。在两个数据集上的全面实验表明,我们的方法使SDEdit生成了针对属性的目标数据分布,显著优于基线。
https://arxiv.org/abs/2402.03705
Artistic video portrait generation is a significant and sought-after task in the fields of computer graphics and vision. While various methods have been developed that integrate NeRFs or StyleGANs with instructional editing models for creating and editing drivable portraits, these approaches face several challenges. They often rely heavily on large datasets, require extensive customization processes, and frequently result in reduced image quality. To address the above problems, we propose the Efficient Monotonic Video Style Avatar (Emo-Avatar) through deferred neural rendering that enhances StyleGAN's capacity for producing dynamic, drivable portrait videos. We proposed a two-stage deferred neural rendering pipeline. In the first stage, we utilize few-shot PTI initialization to initialize the StyleGAN generator through several extreme poses sampled from the video to capture the consistent representation of aligned faces from the target portrait. In the second stage, we propose a Laplacian pyramid for high-frequency texture sampling from UV maps deformed by dynamic flow of expression for motion-aware texture prior integration to provide torso features to enhance StyleGAN's ability to generate complete and upper body for portrait video rendering. Emo-Avatar reduces style customization time from hours to merely 5 minutes compared with existing methods. In addition, Emo-Avatar requires only a single reference image for editing and employs region-aware contrastive learning with semantic invariant CLIP guidance, ensuring consistent high-resolution output and identity preservation. Through both quantitative and qualitative assessments, Emo-Avatar demonstrates superior performance over existing methods in terms of training efficiency, rendering quality and editability in self- and cross-reenactment.
艺术视频肖像生成是在计算机图形学和视觉领域的一个重要且备受关注的目标。虽然已经开发了许多将NeRFs或StyleGAN与教学编辑模型相结合的方法来创建和编辑可驾驶肖像,但这些问题仍然存在。它们通常需要依赖大量数据,需要进行广泛的定制,并经常导致图像质量降低。为解决这些问题,我们提出了Efficient Monotonic Video Style Avatar(Emo-Avatar),通过 deferred neural rendering 进行延期神经渲染,以增强StyleGAN在制作动态、可驾驶肖像视频方面的能力。我们提出了一个两阶段延时神经渲染管道。在第一阶段,我们利用少样本PTI初始化来通过从视频中采样极端姿态来初始化StyleGAN生成器,以捕捉目标肖像中始终保持一致的对齐面。在第二阶段,我们提出了Laplacian金字塔用于从变形动态流动表达的UV地图中采样高频率纹理,以实现运动感知纹理先前集成,从而提供躯体特征,增强StyleGAN生成完整和上半身的能力。Emo-Avatar将风格定制时间从小时降低到了仅需5分钟,与现有方法相比具有优越性能。此外,Emo-Avatar只需要一个参考图像进行编辑,并采用基于语义不变的CLIP的局部感知对比学习,确保始终如一的高分辨率输出和身份保留。通过定量和定性评估,Emo-Avatar在自演和跨演等方面的现有方法上表现出卓越的性能。
https://arxiv.org/abs/2402.00827
The multifaceted nature of human perception and comprehension indicates that, when we think, our body can naturally take any combination of senses, a.k.a., modalities and form a beautiful picture in our brain. For example, when we see a cattery and simultaneously perceive the cat's purring sound, our brain can construct a picture of a cat in the cattery. Intuitively, generative AI models should hold the versatility of humans and be capable of generating images from any combination of modalities efficiently and collaboratively. This paper presents ImgAny, a novel end-to-end multi-modal generative model that can mimic human reasoning and generate high-quality images. Our method serves as the first attempt in its capacity of efficiently and flexibly taking any combination of seven modalities, ranging from language, audio to vision modalities, including image, point cloud, thermal, depth, and event data. Our key idea is inspired by human-level cognitive processes and involves the integration and harmonization of multiple input modalities at both the entity and attribute levels without specific tuning across modalities. Accordingly, our method brings two novel training-free technical branches: 1) Entity Fusion Branch ensures the coherence between inputs and outputs. It extracts entity features from the multi-modal representations powered by our specially constructed entity knowledge graph; 2) Attribute Fusion Branch adeptly preserves and processes the attributes. It efficiently amalgamates distinct attributes from diverse input modalities via our proposed attribute knowledge graph. Lastly, the entity and attribute features are adaptively fused as the conditional inputs to the pre-trained Stable Diffusion model for image generation. Extensive experiments under diverse modality combinations demonstrate its exceptional capability for visual content creation.
多样的感知和理解表明,当我们思考时,我们的身体可以自然地采取任何组合的感觉,也就是模态,并在我们的大脑中形成美丽的画面。例如,当我们看到一个养猫处并且同时听到猫的咕噜声时,我们的大脑可以在猫笼子里构建出一幅猫的形象。直觉上,生成型人工智能模型应该具有人类多才多艺的能力,并能够高效且协同地生成任何组合的模态图像。本文介绍了一种新颖的端到端多模态生成模型——ImgAny,它能够模仿人类的推理并生成高质量图像。我们的方法在七种模态(从语言、音频到视觉模态)的结合方面具有高效性和灵活性,包括图像、点云、热成像、深度和事件数据。我们关键的想法源于人类级别的认知过程,并涉及在实体和属性级别整合和协调多个输入模态,而无需对模态进行特定调谐。因此,我们的方法带来了两个新的无训练技术分支:1)实体融合分支确保输入和输出之间的连贯性。它从我们特别构建的实体知识图中提取实体特征;2)属性融合分支巧妙地保留和处理属性。它通过我们提出的属性知识图有效地整合了来自不同输入模态的显著属性。最后,实体和属性特征作为条件输入到预训练的Stable Diffusion模型,用于图像生成。在多样模态组合的广泛实验中,它表现出惊人的图像内容创作能力。
https://arxiv.org/abs/2401.17664
In contemporary design practices, the integration of computer vision and generative artificial intelligence (genAI) represents a transformative shift towards more interactive and inclusive processes. These technologies offer new dimensions of image analysis and generation, which are particularly relevant in the context of urban landscape reconstruction. This paper presents a novel workflow encapsulated within a prototype application, designed to leverage the synergies between advanced image segmentation and diffusion models for a comprehensive approach to urban design. Our methodology encompasses the OneFormer model for detailed image segmentation and the Stable Diffusion XL (SDXL) diffusion model, implemented through ControlNet, for generating images from textual descriptions. Validation results indicated a high degree of performance by the prototype application, showcasing significant accuracy in both object detection and text-to-image generation. This was evidenced by superior Intersection over Union (IoU) and CLIP scores across iterative evaluations for various categories of urban landscape features. Preliminary testing included utilising UrbanGenAI as an educational tool enhancing the learning experience in design pedagogy, and as a participatory instrument facilitating community-driven urban planning. Early results suggested that UrbanGenAI not only advances the technical frontiers of urban landscape reconstruction but also provides significant pedagogical and participatory planning benefits. The ongoing development of UrbanGenAI aims to further validate its effectiveness across broader contexts and integrate additional features such as real-time feedback mechanisms and 3D modelling capabilities. Keywords: generative AI; panoptic image segmentation; diffusion models; urban landscape design; design pedagogy; co-design
在当代设计实践中,将计算机视觉和生成式人工智能(genAI)相结合代表了一种向更交互和包容性过程的转变。这些技术提供了新的图像分析和生成维度,特别是在城市景观重建的背景下,这些维度尤为重要。本文介绍了一种新的工作流程,该工作流程封装在一个原型应用程序中,旨在利用高级图像分割和扩散模型的协同作用,实现全面的城市场景设计。我们的方法论包括OneFormer模型(详细图像分割)和Stable Diffusion XL(SDXL)扩散模型,通过ControlNet实现从文本描述生成图像。验证结果表明,原型应用程序的表现非常出色,展示了在物体检测和文本到图像生成方面的显著准确性。这通过各种城市景观特征的迭代评估中的IoU和CLIP得分得到了证实。初步测试包括利用UrbanGenAI作为教学工具来提高设计教育体验,以及作为参与式工具促进社区驱动的城市场景规划。初步结果表明,UrbanGenAI不仅推动了城市场景重建的技术前沿,而且提供了显著的教育和参与式规划优势。UrbanGenAI的持续发展旨在进一步验证其有效性,并纳入实时反馈机制和3D建模等功能。关键词:生成式人工智能;全景图像分割;扩散模型;城市场景设计;设计教育;共同设计
https://arxiv.org/abs/2401.14379
Advancements in generative models have sparked significant interest in generating images while adhering to specific structural guidelines. Scene graph to image generation is one such task of generating images which are consistent with the given scene graph. However, the complexity of visual scenes poses a challenge in accurately aligning objects based on specified relations within the scene graph. Existing methods approach this task by first predicting a scene layout and generating images from these layouts using adversarial training. In this work, we introduce a novel approach to generate images from scene graphs which eliminates the need of predicting intermediate layouts. We leverage pre-trained text-to-image diffusion models and CLIP guidance to translate graph knowledge into images. Towards this, we first pre-train our graph encoder to align graph features with CLIP features of corresponding images using a GAN based training. Further, we fuse the graph features with CLIP embedding of object labels present in the given scene graph to create a graph consistent CLIP guided conditioning signal. In the conditioning input, object embeddings provide coarse structure of the image and graph features provide structural alignment based on relationships among objects. Finally, we fine tune a pre-trained diffusion model with the graph consistent conditioning signal with reconstruction and CLIP alignment loss. Elaborate experiments reveal that our method outperforms existing methods on standard benchmarks of COCO-stuff and Visual Genome dataset.
生成模型的进步引发了在遵守特定结构指南的同时生成图像的浓厚兴趣。场景图到图像生成是生成与给定场景图一致的图像的一种任务。然而,视觉场景的复杂性使得根据指定关系准确对场景图中的对象进行对齐具有挑战性。现有的方法通过首先预测场景布局并使用对抗训练从布局中生成图像来解决这个问题。在这项工作中,我们引入了一种生成图像从场景图的新方法,该方法消除了预测中间布局的需求。我们利用预训练的文本到图像扩散模型和CLIP指导将图知识转化为图像。为此,我们首先通过基于GAN的训练将场景图特征与相应图像的CLIP特征对齐。进一步,我们将场景图特征与给定场景图中的物体标签的CLIP嵌入合并,创建了一个具有图一致性的CLIP指导条件信号。在条件输入中,物体嵌入提供了图像的粗结构,而图特征提供了基于物体之间关系结构的平滑对齐。最后,我们通过与图一致性条件信号和重构和CLIP对齐损失对预训练扩散模型进行微调。通过详细的实验,我们发现我们的方法在COCO-stuff和Visual Genome数据集的标准基准上超过了现有方法。
https://arxiv.org/abs/2401.14111
Recent text-to-image generation models have demonstrated incredible success in generating images that faithfully follow input prompts. However, the requirement of using words to describe a desired concept provides limited control over the appearance of the generated concepts. In this work, we address this shortcoming by proposing an approach to enable personalization capabilities in existing text-to-image diffusion models. We propose a novel architecture (BootPIG) that allows a user to provide reference images of an object in order to guide the appearance of a concept in the generated images. The proposed BootPIG architecture makes minimal modifications to a pretrained text-to-image diffusion model and utilizes a separate UNet model to steer the generations toward the desired appearance. We introduce a training procedure that allows us to bootstrap personalization capabilities in the BootPIG architecture using data generated from pretrained text-to-image models, LLM chat agents, and image segmentation models. In contrast to existing methods that require several days of pretraining, the BootPIG architecture can be trained in approximately 1 hour. Experiments on the DreamBooth dataset demonstrate that BootPIG outperforms existing zero-shot methods while being comparable with test-time finetuning approaches. Through a user study, we validate the preference for BootPIG generations over existing methods both in maintaining fidelity to the reference object's appearance and aligning with textual prompts.
近年来,基于文本到图像生成的模型已经在生成符合输入提示的图像方面取得了巨大的成功。然而,使用词语描述所需的概念仅提供了对生成概念外观的有限控制。在这项工作中,我们通过提出一种方法来解决这一缺陷,从而在现有的文本到图像扩散模型中实现个性化功能。我们提出了一个名为(BootPIG)的新架构,允许用户提供参考图像,以指导生成图像中概念的外观。所提出的BootPIG架构对预训练的文本到图像扩散模型进行了最小修改,并利用了一个单独的UNet模型来引导生成向所需的 appearance 方向发展。我们引入了一种训练程序,使得我们能够通过预训练的文本到图像模型、LLM聊天机器人和图像分割模型产生的数据来引导BootPIG架构的个人化功能。与现有的方法需要数天预训练相比,BootPIG架构可以在大约1小时的训练时间内进行训练。在DreamBooth数据集上的实验表明,BootPIG在保持对参考对象外观的忠实度以及与文本提示对齐的同时,超越了现有的零散方法。通过用户研究,我们验证了BootPIG生成器在保持对参考对象外观的忠实度以及与文本提示对齐方面的优势。
https://arxiv.org/abs/2401.13974