Recent advancement in text-to-image models (e.g., Stable Diffusion) and corresponding personalized technologies (e.g., DreamBooth and LoRA) enables individuals to generate high-quality and imaginative images. However, they often suffer from limitations when generating images with resolutions outside of their trained domain. To overcome this limitation, we present the Resolution Adapter (ResAdapter), a domain-consistent adapter designed for diffusion models to generate images with unrestricted resolutions and aspect ratios. Unlike other multi-resolution generation methods that process images of static resolution with complex post-process operations, ResAdapter directly generates images with the dynamical resolution. Especially, after learning a deep understanding of pure resolution priors, ResAdapter trained on the general dataset, generates resolution-free images with personalized diffusion models while preserving their original style domain. Comprehensive experiments demonstrate that ResAdapter with only 0.5M can process images with flexible resolutions for arbitrary diffusion models. More extended experiments demonstrate that ResAdapter is compatible with other modules (e.g., ControlNet, IP-Adapter and LCM-LoRA) for image generation across a broad range of resolutions, and can be integrated into other multi-resolution model (e.g., ElasticDiffusion) for efficiently generating higher-resolution images. Project link is this https URL
近年来,文本到图像模型(例如,Stable Diffusion)及其相应个性化技术(例如,DreamBooth 和 LoRA)的进步使得个人能够生成高质量和富有想象力的图像。然而,在生成训练领域之外的高分辨率图像时,它们往往存在局限性。为了克服这一局限,我们提出了分辨率适配器(ResAdapter),一种针对扩散模型的领域一致的适配器,用于生成无限制分辨率和平衡比的图像。与其他多分辨率生成方法不同,ResAdapter直接生成具有动态分辨率的图像。特别是在学习了对纯分辨率 prior 的深刻理解后,ResAdapter在训练通用数据集的同时,使用个性化的扩散模型生成具有个人风格域的分辨率无限制的图像。全面的实验证明,ResAdapter仅需0.5M即可处理任意扩散模型的灵活分辨率图像。更广泛的实验证明,ResAdapter与其他模块(例如,控制网、IP-适配器和LCM-LoRA)在各种分辨率范围内生成图像兼容,并可以集成到其他多分辨率模型(例如,ElasticDiffusion)中,以高效生成高分辨率图像。项目链接是:<https://www.projectlink.io/project/resadapter>
https://arxiv.org/abs/2403.02084
Text-to-image generative models can generate high-quality humans, but realism is lost when generating hands. Common artifacts include irregular hand poses, shapes, incorrect numbers of fingers, and physically implausible finger orientations. To generate images with realistic hands, we propose a novel diffusion-based architecture called HanDiffuser that achieves realism by injecting hand embeddings in the generative process. HanDiffuser consists of two components: a Text-to-Hand-Params diffusion model to generate SMPL-Body and MANO-Hand parameters from input text prompts, and a Text-Guided Hand-Params-to-Image diffusion model to synthesize images by conditioning on the prompts and hand parameters generated by the previous component. We incorporate multiple aspects of hand representation, including 3D shapes and joint-level finger positions, orientations and articulations, for robust learning and reliable performance during inference. We conduct extensive quantitative and qualitative experiments and perform user studies to demonstrate the efficacy of our method in generating images with high-quality hands.
文本到图像生成模型可以生成高质量的人类,但在生成手部时会丢失真实性。常见的问题包括不规则的手姿势、形状、手指数量的不正确以及不现实的手指方向。为了生成具有现实手部的图像,我们提出了一个新颖的扩散基础架构,称为HanDiffuser,通过在生成过程中注入手部嵌入来实现真实性。HanDiffuser由两个组件组成:一个从输入文本提示中生成SMPL-Body和MANO-Hand参数的Text-to-Hand-Params扩散模型,和一个根据先前的组件生成的提示和手部参数合成图像的Text-Guided Hand-Params-to-Image扩散模型。我们结合了多个手部表示方面,包括3D形状和关节级别的手指位置、方向和关节活动,以实现稳健的学习和可靠的推理性能。我们进行了广泛的定量实验和用户研究,以证明我们方法在生成高质量手部图像方面的有效性。
https://arxiv.org/abs/2403.01693
The class-conditional image generation based on diffusion models is renowned for generating high-quality and diverse images. However, most prior efforts focus on generating images for general categories, e.g., 1000 classes in ImageNet-1k. A more challenging task, large-scale fine-grained image generation, remains the boundary to explore. In this work, we present a parameter-efficient strategy, called FineDiffusion, to fine-tune large pre-trained diffusion models scaling to large-scale fine-grained image generation with 10,000 categories. FineDiffusion significantly accelerates training and reduces storage overhead by only fine-tuning tiered class embedder, bias terms, and normalization layers' parameters. To further improve the image generation quality of fine-grained categories, we propose a novel sampling method for fine-grained image generation, which utilizes superclass-conditioned guidance, specifically tailored for fine-grained categories, to replace the conventional classifier-free guidance sampling. Compared to full fine-tuning, FineDiffusion achieves a remarkable 1.56x training speed-up and requires storing merely 1.77% of the total model parameters, while achieving state-of-the-art FID of 9.776 on image generation of 10,000 classes. Extensive qualitative and quantitative experiments demonstrate the superiority of our method compared to other parameter-efficient fine-tuning methods. The code and more generated results are available at our project website: this https URL.
基于扩散模型的类条件图像生成以生成高质量和多样化的图像而闻名。然而,大多数努力都集中在为一般类别生成图像上,例如ImageNet-1k中的1000个类别。一个更具挑战性的任务是大规模细粒度图像生成,这是我们需要探索的边界。在这篇工作中,我们提出了一个参数高效的策略,称为FineDiffusion,用于在具有10,000个类别的较大预训练扩散模型上进行微调,实现大规模细粒度图像生成。FineDiffusion通过仅微调级联分类器、偏置层和归一化层的参数,显著加速训练并减少了存储开销。为了进一步提高细粒度类别的图像生成质量,我们提出了一个新的细粒度图像生成采样方法,利用超类条件指导,特别是为细粒度类别定制的,以取代传统的分类器无指导采样。与完全微调相比,FineDiffusion实现了训练速度的1.56倍,仅需要存储总模型的1.77%的参数,同时实现了与细粒度类别生成图像的FID达到9.776的 state-of-the-art水平。大量的定性和定量实验证明了我们的方法与其他参数高效的细粒度调整方法相比具有优越性。代码和更多生成结果可以在我们的项目网站上查看:https://this URL。
https://arxiv.org/abs/2402.18331
Precision devices play an important role in enhancing production quality and productivity in agricultural systems. Therefore, the optimization of these devices is essential in precision agriculture. Recently, with the advancements of deep learning, there have been several studies aiming to harness its capabilities for improving spray system performance. However, the effectiveness of these methods heavily depends on the size of the training dataset, which is expensive and time-consuming to collect. To address the challenge of insufficient training samples, this paper proposes an alternative solution by generating artificial images of droplets using generative adversarial networks (GAN). The GAN model is trained by using a small dataset captured by a high-speed camera and capable of generating images with progressively increasing resolution. The results demonstrate that the model can generate high-quality images with the size of $1024\times1024$. Furthermore, this research leverages recent advancements in computer vision and deep learning to develop a light droplet detector using the synthetic dataset. As a result, the detection model achieves a 16.06\% increase in mean average precision (mAP) when utilizing the synthetic dataset. To the best of our knowledge, this work stands as the first to employ a generative model for augmenting droplet detection. Its significance lies not only in optimizing nozzle design for constructing efficient spray systems but also in addressing the common challenge of insufficient data in various precision agriculture tasks. This work offers a critical contribution to conserving resources while striving for optimal and sustainable agricultural practices.
精密设备在农业系统中提高生产质量和生产力的作用非常重要。因此,优化这些设备对于精准农业来说至关重要。近年来,随着深度学习的进步,已经有一些研究试图利用其能力来提高喷雾系统性能。然而,这些方法的有效性很大程度上取决于训练数据集的大小,这需要花费大量的时间和金钱来收集。为解决训练样本不足的问题,本文提出了一种通过生成对抗网络(GAN)生成雾滴的人工图像来替代现有方案的解决方案。GAN模型通过使用高速相机捕获的小数据集进行训练,能够生成具有逐渐增加分辨率的图像。结果显示,该模型可以生成$1024\times1024$大小的优质图像。此外,本文利用计算机视觉和深度学习的最新进展,开发了一种使用合成数据集的轻型雾滴检测器。结果表明,当利用合成数据集时,检测模型平均精准度(mAP)增长了16.06%。据我们所知,这项工作是第一个采用生成模型来增强雾滴检测的。其重要性不仅在于优化雾滴喷嘴设计以构建高效的喷雾系统,而且在于解决各种精准农业任务中数据不足的常见挑战。这项工作为在寻求 optimal and sustainable agricultural practices的同时保护资源做出了重要的贡献。
https://arxiv.org/abs/2402.15909
Contemporary models for generating images show remarkable quality and versatility. Swayed by these advantages, the research community repurposes them to generate videos. Since video content is highly redundant, we argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability. In this work, we build Snap Video, a video-first model that systematically addresses these challenges. To do that, we first extend the EDM framework to take into account spatially and temporally redundant pixels and naturally support video generation. Second, we show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead. Hence, we propose a new transformer-based architecture that trains 3.31 times faster than U-Nets (and is ~4.5 faster at inference). This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity. The user studies showed that our model was favored by a large margin over the most recent methods. See our website at this https URL.
当代生成图像的模型具有出色的质量和多样性。受到这些优势的启发,研究社区将它们用于生成视频。由于视频内容高度冗余,我们认为过于简单地将图像模型的进步应用到视频生成领域会降低运动质量、视觉质量和可扩展性。在这项工作中,我们构建了Snap Video,一种视频优先的模型,系统地解决这些挑战。为此,我们首先将EDM框架扩展到考虑空间和时间冗余的像素,并自然支持视频生成。然后,我们证明了当生成视频时,U-Net - 负责图像生成的关键技术 - 表现不佳,需要大量的计算开销。因此,我们提出了一个基于Transformer的新架构,该架构训练速度比U-Nets快3.31倍(并且在推理时约快4.5倍)。这使得我们能够高效地训练具有数十亿参数的文本到视频模型,达到一些基准测试的最好结果,并生成具有相当高的质量、时间和运动复杂性的视频。用户研究表明,我们的模型在最近的方法中优势明显。请查看我们的网站:https://www.thisurl.com。
https://arxiv.org/abs/2402.14797
Nature is infinitely resolution-free. In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain. To overcome this limitation, we present the Flexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with unrestricted resolutions and aspect ratios. Unlike traditional methods that perceive images as static-resolution grids, FiT conceptualizes images as sequences of dynamically-sized tokens. This perspective enables a flexible training strategy that effortlessly adapts to diverse aspect ratios during both training and inference phases, thus promoting resolution generalization and eliminating biases induced by image cropping. Enhanced by a meticulously adjusted network structure and the integration of training-free extrapolation techniques, FiT exhibits remarkable flexibility in resolution extrapolation generation. Comprehensive experiments demonstrate the exceptional performance of FiT across a broad range of resolutions, showcasing its effectiveness both within and beyond its training resolution distribution. Repository available at this https URL.
自然无限分辨率。在这个现实世界的背景下,经过训练的扩散模型(如Diffusion Transformers)在处理超出其训练领域图像分辨率的问题时常常面临挑战。为了克服这一限制,我们提出了灵活视觉Transformer(FiT),一种专为生成无限制分辨率和大小的图像而设计的Transformer架构。与传统方法将图像视为静态分辨率网格的不同,FiT将图像视为大小可变的数据序列。这种观点使得在训练和推理阶段都能轻松适应各种 aspect ratios,从而促进了分辨率泛化并消除了由图像裁剪引起的偏差。通过精心调整网络结构和集成无训练扩展技术,FiT在分辨率扩展生成方面表现出非凡的灵活性。全面的实验证明,FiT在广泛的分辨率范围内都表现出优异的性能,证明了其在其训练分辨率分布之外的有效性。您可以在此链接的仓库中访问FiT:https://github.com/your_username/your_project_name
https://arxiv.org/abs/2402.12376
Text-to-Image (T2I) models have shown great performance in generating images based on textual prompts. However, these models are vulnerable to unsafe input to generate unsafe content like sexual, harassment and illegal-activity images. Existing studies based on image checker, model fine-tuning and embedding blocking are impractical in real-world applications. Hence, \textit{we propose the first universal prompt optimizer for safe T2I generation in black-box scenario}. We first construct a dataset consisting of toxic-clean prompt pairs by GPT-3.5 Turbo. To guide the optimizer to have the ability of converting toxic prompt to clean prompt while preserving semantic information, we design a novel reward function measuring toxicity and text alignment of generated images and train the optimizer through Proximal Policy Optimization. Experiments show that our approach can effectively reduce the likelihood of various T2I models in generating inappropriate images, with no significant impact on text alignment. It is also flexible to be combined with methods to achieve better performance.
文本到图像(T2I)模型在基于文本提示生成图像方面表现出色。然而,这些模型容易受到生成不安全内容的攻击,如性骚扰和非法活动图像。现有基于图像检查器、模型微调和高置信度嵌入的studies在现实应用中是行不通的。因此,我们提出了第一个在黑盒场景下实现通用提示优化的T2I模型。我们首先通过GPT-3.5 Turbo构建了一个包含毒性干净提示对的 dataset。为了指导优化器具有将毒性提示转换为干净提示的能力,同时保留语义信息,我们设计了一个新的奖励函数来衡量生成图像的毒性及文本对齐,并通过Proximal Policy Optimization训练优化器。实验证明,我们的方法可以有效降低各种T2I模型生成不适当图像的概率,而不会对文本对齐产生显著影响。它还可以与方法结合以实现更好的性能。
https://arxiv.org/abs/2402.10882
Diffusion models have proven to be highly effective in image and video generation; however, they still face composition challenges when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models for higher resolution demands substantial computational and optimization resources, yet achieving a generation capability comparable to low-resolution models remains elusive. This paper proposes a novel self-cascade diffusion model that leverages the rich knowledge gained from a well-trained low-resolution model for rapid adaptation to higher-resolution image and video generation, employing either tuning-free or cheap upsampler tuning paradigms. Integrating a sequence of multi-scale upsampler modules, the self-cascade diffusion model can efficiently adapt to a higher resolution, preserving the original composition and generation capabilities. We further propose a pivot-guided noise re-schedule strategy to speed up the inference process and improve local structural details. Compared to full fine-tuning, our approach achieves a 5X training speed-up and requires only an additional 0.002M tuning parameters. Extensive experiments demonstrate that our approach can quickly adapt to higher resolution image and video synthesis by fine-tuning for just 10k steps, with virtually no additional inference time.
扩散模型已经在图像和视频生成方面证明高度有效;然而,当生成不同尺寸的图像时,由于单尺度训练数据,它们仍然面临着构图挑战。为适应高分辨率图像和视频生成,将大型预训练扩散模型扩展到更高的分辨率需要大量的计算和优化资源,然而,实现与低分辨率模型相当的水平仍然具有挑战性。本文提出了一种新型的自级联扩散模型,它利用了预训练低分辨率模型所获得的有用知识,以快速适应高分辨率图像和视频生成,采用调整免费或便宜的升采样范式。将一系列多尺度升采样模块集成到自级联扩散模型中,它可以有效地适应高分辨率,保留原始构图和生成能力。我们还提出了一种引导式噪声重新调度策略,以加速推理过程并改善局部结构细节。与完全重训练相比,我们的方法实现了5倍的学习速度,只需要额外的0.002M个调整参数。大量实验证明,通过仅微调10k步,我们的方法可以快速适应高分辨率图像和视频合成,几乎不需要额外的推理时间。
https://arxiv.org/abs/2402.10491
Diffusion-based image generation models such as DALL-E 3 and Stable Diffusion-XL demonstrate remarkable capabilities in generating images with realistic and unique compositions. Yet, these models are not robust in precisely reasoning about physical and spatial configurations of objects, especially when instructed with unconventional, thereby out-of-distribution descriptions, such as "a chair with five legs". In this paper, we propose a language agent with chain-of-3D-thoughts (L3GO), an inference-time approach that can reason about part-based 3D mesh generation of unconventional objects that current data-driven diffusion models struggle with. More concretely, we use large language models as agents to compose a desired object via trial-and-error within the 3D simulation environment. To facilitate our investigation, we develop a new benchmark, Unconventionally Feasible Objects (UFO), as well as SimpleBlenv, a wrapper environment built on top of Blender where language agents can build and compose atomic building blocks via API calls. Human and automatic GPT-4V evaluations show that our approach surpasses the standard GPT-4 and other language agents (e.g., ReAct and Reflexion) for 3D mesh generation on ShapeNet. Moreover, when tested on our UFO benchmark, our approach outperforms other state-of-the-art text-to-2D image and text-to-3D models based on human evaluation.
扩散基图像生成模型,如DALL-E 3 和 Stable Diffusion-XL,在生成具有真实感和独特组成的图像方面表现出非凡的能力。然而,这些模型在精确推理物体物理和空间配置时并不稳健,尤其是在使用不寻常的、非规范的指令时,从而导致分布外推。在本文中,我们提出了一个3D思考链(L3GO)语言代理,这是一种推理基于部分3D网格生成非规范物体的方法。具体来说,我们使用大型语言模型作为代理,在3D仿真环境中通过尝试和错误来合成所需对象。为了方便我们的研究,我们还开发了一个名为Unconventional Feasible Objects(UFO)的新基准,以及 SimpleBlenv,一个基于Blender的封装环境,其中语言代理可以通过API调用构建和组合原子构建块。人类和自动GPT-4V评估表明,我们的方法超越了标准GPT-4和其他语言模型(如ReAct和Reflexion)在ShapeNet上3D网格生成方面的标准。此外,当我们的UFO基准被测试时,我们的方法在其他基于人类评估的2D图像和2D-3D模型上均表现优异。
https://arxiv.org/abs/2402.09052
Guided image synthesis methods, like SDEdit based on the diffusion model, excel at creating realistic images from user inputs such as stroke paintings. However, existing efforts mainly focus on image quality, often overlooking a key point: the diffusion model represents a data distribution, not individual images. This introduces a low but critical chance of generating images that contradict user intentions, raising ethical concerns. For example, a user inputting a stroke painting with female characteristics might, with some probability, get male faces from SDEdit. To expose this potential vulnerability, we aim to build an adversarial attack forcing SDEdit to generate a specific data distribution aligned with a specified attribute (e.g., female), without changing the input's attribute characteristics. We propose the Targeted Attribute Generative Attack (TAGA), using an attribute-aware objective function and optimizing the adversarial noise added to the input stroke painting. Empirical studies reveal that traditional adversarial noise struggles with TAGA, while natural perturbations like exposure and motion blur easily alter generated images' attributes. To execute effective attacks, we introduce FoolSDEdit: We design a joint adversarial exposure and blur attack, adding exposure and motion blur to the stroke painting and optimizing them together. We optimize the execution strategy of various perturbations, framing it as a network architecture search problem. We create the SuperPert, a graph representing diverse execution strategies for different perturbations. After training, we obtain the optimized execution strategy for effective TAGA against SDEdit. Comprehensive experiments on two datasets show our method compelling SDEdit to generate a targeted attribute-aware data distribution, significantly outperforming baselines.
指导图像合成方法,如基于扩散模型的SDEdit,在从用户输入的绘笔画创建逼真的图像方面表现出色。然而,现有努力主要关注图像质量,往往忽视了一个关键点:扩散模型表示数据分布,而不是单个图像。这导致生成图像与用户意图相矛盾的可能性较低,但存在伦理问题。例如,用户输入具有女性特征的绘笔画,在SDEdit中,有一定概率会生成具有男性特征的图像。为了揭示这个潜在的安全漏洞,我们旨在建立一个对抗攻击,迫使SDEdit生成与指定属性(例如女性)相符的特定数据分布,同时不改变输入的属性特征。我们提出了Targeted Attribute Generative Attack(TAGA),使用具有属性的目标函数和优化输入绘笔画的对抗噪声。实验研究表明,传统的对抗噪声很难与TAGA相比,而自然扰动(例如曝光和模糊)很容易改变生成的图像的属性。为了有效地执行攻击,我们引入了FoolSDEdit:我们设计了一个联合对抗曝光和模糊攻击,将曝光和模糊添加到绘笔画中,并一起优化它们。我们优化了各种扰动的执行策略,将其封装为网络架构搜索问题。我们创建了SuperPert,表示不同扰动执行策略的图形。在训练之后,我们获得了有效TAGA对SDEdit的优化执行策略。在两个数据集上的全面实验表明,我们的方法使SDEdit生成了针对属性的目标数据分布,显著优于基线。
https://arxiv.org/abs/2402.03705
Artistic video portrait generation is a significant and sought-after task in the fields of computer graphics and vision. While various methods have been developed that integrate NeRFs or StyleGANs with instructional editing models for creating and editing drivable portraits, these approaches face several challenges. They often rely heavily on large datasets, require extensive customization processes, and frequently result in reduced image quality. To address the above problems, we propose the Efficient Monotonic Video Style Avatar (Emo-Avatar) through deferred neural rendering that enhances StyleGAN's capacity for producing dynamic, drivable portrait videos. We proposed a two-stage deferred neural rendering pipeline. In the first stage, we utilize few-shot PTI initialization to initialize the StyleGAN generator through several extreme poses sampled from the video to capture the consistent representation of aligned faces from the target portrait. In the second stage, we propose a Laplacian pyramid for high-frequency texture sampling from UV maps deformed by dynamic flow of expression for motion-aware texture prior integration to provide torso features to enhance StyleGAN's ability to generate complete and upper body for portrait video rendering. Emo-Avatar reduces style customization time from hours to merely 5 minutes compared with existing methods. In addition, Emo-Avatar requires only a single reference image for editing and employs region-aware contrastive learning with semantic invariant CLIP guidance, ensuring consistent high-resolution output and identity preservation. Through both quantitative and qualitative assessments, Emo-Avatar demonstrates superior performance over existing methods in terms of training efficiency, rendering quality and editability in self- and cross-reenactment.
艺术视频肖像生成是在计算机图形学和视觉领域的一个重要且备受关注的目标。虽然已经开发了许多将NeRFs或StyleGAN与教学编辑模型相结合的方法来创建和编辑可驾驶肖像,但这些问题仍然存在。它们通常需要依赖大量数据,需要进行广泛的定制,并经常导致图像质量降低。为解决这些问题,我们提出了Efficient Monotonic Video Style Avatar(Emo-Avatar),通过 deferred neural rendering 进行延期神经渲染,以增强StyleGAN在制作动态、可驾驶肖像视频方面的能力。我们提出了一个两阶段延时神经渲染管道。在第一阶段,我们利用少样本PTI初始化来通过从视频中采样极端姿态来初始化StyleGAN生成器,以捕捉目标肖像中始终保持一致的对齐面。在第二阶段,我们提出了Laplacian金字塔用于从变形动态流动表达的UV地图中采样高频率纹理,以实现运动感知纹理先前集成,从而提供躯体特征,增强StyleGAN生成完整和上半身的能力。Emo-Avatar将风格定制时间从小时降低到了仅需5分钟,与现有方法相比具有优越性能。此外,Emo-Avatar只需要一个参考图像进行编辑,并采用基于语义不变的CLIP的局部感知对比学习,确保始终如一的高分辨率输出和身份保留。通过定量和定性评估,Emo-Avatar在自演和跨演等方面的现有方法上表现出卓越的性能。
https://arxiv.org/abs/2402.00827
The multifaceted nature of human perception and comprehension indicates that, when we think, our body can naturally take any combination of senses, a.k.a., modalities and form a beautiful picture in our brain. For example, when we see a cattery and simultaneously perceive the cat's purring sound, our brain can construct a picture of a cat in the cattery. Intuitively, generative AI models should hold the versatility of humans and be capable of generating images from any combination of modalities efficiently and collaboratively. This paper presents ImgAny, a novel end-to-end multi-modal generative model that can mimic human reasoning and generate high-quality images. Our method serves as the first attempt in its capacity of efficiently and flexibly taking any combination of seven modalities, ranging from language, audio to vision modalities, including image, point cloud, thermal, depth, and event data. Our key idea is inspired by human-level cognitive processes and involves the integration and harmonization of multiple input modalities at both the entity and attribute levels without specific tuning across modalities. Accordingly, our method brings two novel training-free technical branches: 1) Entity Fusion Branch ensures the coherence between inputs and outputs. It extracts entity features from the multi-modal representations powered by our specially constructed entity knowledge graph; 2) Attribute Fusion Branch adeptly preserves and processes the attributes. It efficiently amalgamates distinct attributes from diverse input modalities via our proposed attribute knowledge graph. Lastly, the entity and attribute features are adaptively fused as the conditional inputs to the pre-trained Stable Diffusion model for image generation. Extensive experiments under diverse modality combinations demonstrate its exceptional capability for visual content creation.
多样的感知和理解表明,当我们思考时,我们的身体可以自然地采取任何组合的感觉,也就是模态,并在我们的大脑中形成美丽的画面。例如,当我们看到一个养猫处并且同时听到猫的咕噜声时,我们的大脑可以在猫笼子里构建出一幅猫的形象。直觉上,生成型人工智能模型应该具有人类多才多艺的能力,并能够高效且协同地生成任何组合的模态图像。本文介绍了一种新颖的端到端多模态生成模型——ImgAny,它能够模仿人类的推理并生成高质量图像。我们的方法在七种模态(从语言、音频到视觉模态)的结合方面具有高效性和灵活性,包括图像、点云、热成像、深度和事件数据。我们关键的想法源于人类级别的认知过程,并涉及在实体和属性级别整合和协调多个输入模态,而无需对模态进行特定调谐。因此,我们的方法带来了两个新的无训练技术分支:1)实体融合分支确保输入和输出之间的连贯性。它从我们特别构建的实体知识图中提取实体特征;2)属性融合分支巧妙地保留和处理属性。它通过我们提出的属性知识图有效地整合了来自不同输入模态的显著属性。最后,实体和属性特征作为条件输入到预训练的Stable Diffusion模型,用于图像生成。在多样模态组合的广泛实验中,它表现出惊人的图像内容创作能力。
https://arxiv.org/abs/2401.17664
In contemporary design practices, the integration of computer vision and generative artificial intelligence (genAI) represents a transformative shift towards more interactive and inclusive processes. These technologies offer new dimensions of image analysis and generation, which are particularly relevant in the context of urban landscape reconstruction. This paper presents a novel workflow encapsulated within a prototype application, designed to leverage the synergies between advanced image segmentation and diffusion models for a comprehensive approach to urban design. Our methodology encompasses the OneFormer model for detailed image segmentation and the Stable Diffusion XL (SDXL) diffusion model, implemented through ControlNet, for generating images from textual descriptions. Validation results indicated a high degree of performance by the prototype application, showcasing significant accuracy in both object detection and text-to-image generation. This was evidenced by superior Intersection over Union (IoU) and CLIP scores across iterative evaluations for various categories of urban landscape features. Preliminary testing included utilising UrbanGenAI as an educational tool enhancing the learning experience in design pedagogy, and as a participatory instrument facilitating community-driven urban planning. Early results suggested that UrbanGenAI not only advances the technical frontiers of urban landscape reconstruction but also provides significant pedagogical and participatory planning benefits. The ongoing development of UrbanGenAI aims to further validate its effectiveness across broader contexts and integrate additional features such as real-time feedback mechanisms and 3D modelling capabilities. Keywords: generative AI; panoptic image segmentation; diffusion models; urban landscape design; design pedagogy; co-design
在当代设计实践中,将计算机视觉和生成式人工智能(genAI)相结合代表了一种向更交互和包容性过程的转变。这些技术提供了新的图像分析和生成维度,特别是在城市景观重建的背景下,这些维度尤为重要。本文介绍了一种新的工作流程,该工作流程封装在一个原型应用程序中,旨在利用高级图像分割和扩散模型的协同作用,实现全面的城市场景设计。我们的方法论包括OneFormer模型(详细图像分割)和Stable Diffusion XL(SDXL)扩散模型,通过ControlNet实现从文本描述生成图像。验证结果表明,原型应用程序的表现非常出色,展示了在物体检测和文本到图像生成方面的显著准确性。这通过各种城市景观特征的迭代评估中的IoU和CLIP得分得到了证实。初步测试包括利用UrbanGenAI作为教学工具来提高设计教育体验,以及作为参与式工具促进社区驱动的城市场景规划。初步结果表明,UrbanGenAI不仅推动了城市场景重建的技术前沿,而且提供了显著的教育和参与式规划优势。UrbanGenAI的持续发展旨在进一步验证其有效性,并纳入实时反馈机制和3D建模等功能。关键词:生成式人工智能;全景图像分割;扩散模型;城市场景设计;设计教育;共同设计
https://arxiv.org/abs/2401.14379
Advancements in generative models have sparked significant interest in generating images while adhering to specific structural guidelines. Scene graph to image generation is one such task of generating images which are consistent with the given scene graph. However, the complexity of visual scenes poses a challenge in accurately aligning objects based on specified relations within the scene graph. Existing methods approach this task by first predicting a scene layout and generating images from these layouts using adversarial training. In this work, we introduce a novel approach to generate images from scene graphs which eliminates the need of predicting intermediate layouts. We leverage pre-trained text-to-image diffusion models and CLIP guidance to translate graph knowledge into images. Towards this, we first pre-train our graph encoder to align graph features with CLIP features of corresponding images using a GAN based training. Further, we fuse the graph features with CLIP embedding of object labels present in the given scene graph to create a graph consistent CLIP guided conditioning signal. In the conditioning input, object embeddings provide coarse structure of the image and graph features provide structural alignment based on relationships among objects. Finally, we fine tune a pre-trained diffusion model with the graph consistent conditioning signal with reconstruction and CLIP alignment loss. Elaborate experiments reveal that our method outperforms existing methods on standard benchmarks of COCO-stuff and Visual Genome dataset.
生成模型的进步引发了在遵守特定结构指南的同时生成图像的浓厚兴趣。场景图到图像生成是生成与给定场景图一致的图像的一种任务。然而,视觉场景的复杂性使得根据指定关系准确对场景图中的对象进行对齐具有挑战性。现有的方法通过首先预测场景布局并使用对抗训练从布局中生成图像来解决这个问题。在这项工作中,我们引入了一种生成图像从场景图的新方法,该方法消除了预测中间布局的需求。我们利用预训练的文本到图像扩散模型和CLIP指导将图知识转化为图像。为此,我们首先通过基于GAN的训练将场景图特征与相应图像的CLIP特征对齐。进一步,我们将场景图特征与给定场景图中的物体标签的CLIP嵌入合并,创建了一个具有图一致性的CLIP指导条件信号。在条件输入中,物体嵌入提供了图像的粗结构,而图特征提供了基于物体之间关系结构的平滑对齐。最后,我们通过与图一致性条件信号和重构和CLIP对齐损失对预训练扩散模型进行微调。通过详细的实验,我们发现我们的方法在COCO-stuff和Visual Genome数据集的标准基准上超过了现有方法。
https://arxiv.org/abs/2401.14111
Recent text-to-image generation models have demonstrated incredible success in generating images that faithfully follow input prompts. However, the requirement of using words to describe a desired concept provides limited control over the appearance of the generated concepts. In this work, we address this shortcoming by proposing an approach to enable personalization capabilities in existing text-to-image diffusion models. We propose a novel architecture (BootPIG) that allows a user to provide reference images of an object in order to guide the appearance of a concept in the generated images. The proposed BootPIG architecture makes minimal modifications to a pretrained text-to-image diffusion model and utilizes a separate UNet model to steer the generations toward the desired appearance. We introduce a training procedure that allows us to bootstrap personalization capabilities in the BootPIG architecture using data generated from pretrained text-to-image models, LLM chat agents, and image segmentation models. In contrast to existing methods that require several days of pretraining, the BootPIG architecture can be trained in approximately 1 hour. Experiments on the DreamBooth dataset demonstrate that BootPIG outperforms existing zero-shot methods while being comparable with test-time finetuning approaches. Through a user study, we validate the preference for BootPIG generations over existing methods both in maintaining fidelity to the reference object's appearance and aligning with textual prompts.
近年来,基于文本到图像生成的模型已经在生成符合输入提示的图像方面取得了巨大的成功。然而,使用词语描述所需的概念仅提供了对生成概念外观的有限控制。在这项工作中,我们通过提出一种方法来解决这一缺陷,从而在现有的文本到图像扩散模型中实现个性化功能。我们提出了一个名为(BootPIG)的新架构,允许用户提供参考图像,以指导生成图像中概念的外观。所提出的BootPIG架构对预训练的文本到图像扩散模型进行了最小修改,并利用了一个单独的UNet模型来引导生成向所需的 appearance 方向发展。我们引入了一种训练程序,使得我们能够通过预训练的文本到图像模型、LLM聊天机器人和图像分割模型产生的数据来引导BootPIG架构的个人化功能。与现有的方法需要数天预训练相比,BootPIG架构可以在大约1小时的训练时间内进行训练。在DreamBooth数据集上的实验表明,BootPIG在保持对参考对象外观的忠实度以及与文本提示对齐的同时,超越了现有的零散方法。通过用户研究,我们验证了BootPIG生成器在保持对参考对象外观的忠实度以及与文本提示对齐方面的优势。
https://arxiv.org/abs/2401.13974
Existing text-to-image diffusion models primarily generate images from text prompts. However, the inherent conciseness of textual descriptions poses challenges in faithfully synthesizing images with intricate details, such as specific entities or scenes. This paper presents \textbf{UNIMO-G}, a simple multimodal conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs, which demonstrates a unified ability for both text-driven and subject-driven image generation. UNIMO-G comprises two core components: a Multimodal Large Language Model (MLLM) for encoding multimodal prompts, and a conditional denoising diffusion network for generating images based on the encoded multimodal input. We leverage a two-stage training strategy to effectively train the framework: firstly pre-training on large-scale text-image pairs to develop conditional image generation capabilities, and then instruction tuning with multimodal prompts to achieve unified image generation proficiency. A well-designed data processing pipeline involving language grounding and image segmentation is employed to construct multi-modal prompts. UNIMO-G excels in both text-to-image generation and zero-shot subject-driven synthesis, and is notably effective in generating high-fidelity images from complex multimodal prompts involving multiple image entities.
目前,从文本到图像扩散模型的主要特点是它们主要从文本提示中生成图像。然而,文本描述的固有简洁性使得在忠实合成具有复杂细节的图像方面存在挑战,例如特定实体或场景。本文介绍了 UNIMO-G,一种简单多模态条件扩散框架,它在多模态提示上进行操作,包括交替的文本和视觉输入,展示了文本驱动和主题驱动图像生成的统一能力。UNIMO-G 由两个核心组件组成:一个多模态大型语言模型(MLLM)用于编码多模态提示,和一个基于编码多模态输入的条件下去噪扩散网络用于生成图像。我们采用两阶段训练策略来有效训练框架:首先在大型文本图像对上进行预训练,以发展条件图像生成能力;然后通过多模态提示进行指令调整,实现统一图像生成能力。为了构建多模态提示,我们采用了一个涉及语言建模和图像分割的数据处理管道。UNIMO-G 在文本到图像生成和零散主题驱动合成方面表现出色,特别擅长从涉及多个图像实体的复杂多模态提示中生成高保真的图像。
https://arxiv.org/abs/2401.13388
One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio to generate a talking portrait video. The existing methods fail to simultaneously achieve the goals of accurate 3D avatar reconstruction and stable talking face animation. Besides, while the existing works mainly focus on synthesizing the head part, it is also vital to generate natural torso and background segments to obtain a realistic talking portrait video. To address these limitations, we present Real3D-Potrait, a framework that (1) improves the one-shot 3D reconstruction power with a large image-to-plane model that distills 3D prior knowledge from a 3D face generative model; (2) facilitates accurate motion-conditioned animation with an efficient motion adapter; (3) synthesizes realistic video with natural torso movement and switchable background using a head-torso-background super-resolution model; and (4) supports one-shot audio-driven talking face generation with a generalizable audio-to-motion model. Extensive experiments show that Real3D-Portrait generalizes well to unseen identities and generates more realistic talking portrait videos compared to previous methods.
一次性的3D谈话肖像生成旨在从未见过的图像中重构3D虚拟形象,然后通过参考视频或音频来生成谈话肖像视频。现有方法未能同时实现准确3D虚拟形象重建和稳定的谈话面动画。此外,虽然现有作品主要关注合成头部,但生成自然躯干和背景段也是获得真实谈话肖像视频至关重要。为了应对这些局限,我们提出了Real3D-Potrait,一个框架(1)通过大图像到平面模型的方法提高了一次性3D重建的能力,并从3D人脸生成模型中提取3D先验知识;(2)通过高效的运动适配器促进准确的运动条件动画;(3)使用头-躯干-背景超分辨率模型合成真实视频,并可切换背景;(4)支持基于通用音频到运动模型的单次音频驱动谈话面生成。大量实验证明,Real3D-Portrait对未见过的身份泛化效果很好,并比以前的方法生成了更逼真的谈话肖像视频。
https://arxiv.org/abs/2401.08503
In recent years, the recognition of free-hand sketches has remained a popular task. However, in some special fields such as the military field, free-hand sketches are difficult to sample on a large scale. Common data augmentation and image generation techniques are difficult to produce images with various free-hand sketching styles. Therefore, the recognition and segmentation tasks in related fields are limited. In this paper, we propose a novel adversarial generative network that can accurately generate realistic free-hand sketches with various styles. We explore the performance of the model, including using styles randomly sampled from a prior normal distribution to generate images with various free-hand sketching styles, disentangling the painters' styles from known free-hand sketches to generate images with specific styles, and generating images of unknown classes that are not in the training set. We further demonstrate with qualitative and quantitative evaluations our advantages in visual quality, content accuracy, and style imitation on SketchIME.
近年来,自由手绘图的识别仍然是一个流行的任务。然而,在一些特殊领域,如军事领域,手绘图在大规模上很难进行抽样。常见的数据增强和图像生成技术很难生成各种自由手绘图形的图像。因此,相关领域的识别和分割任务存在局限性。在本文中,我们提出了一种新颖的对抗生成网络,可以准确生成各种风格的手绘现实图。我们研究了模型的性能,包括使用从先验正态分布中随机采样各种样式来生成具有各种自由手绘图式风格的图像,将画家的风格与已知的手绘图分离以生成具有特定风格的手绘图,以及生成不在训练集中的未知类别的图像。我们进一步通过质性和量化评估展示了我们在SketchIME在视觉质量、内容准确性和风格模仿方面的优势。
https://arxiv.org/abs/2401.04739
Text-to-image generation is conducted through Generative Adversarial Networks (GANs) or transformer models. However, the current challenge lies in accurately generating images based on textual descriptions, especially in scenarios where the content and theme of the target image are ambiguous. In this paper, we propose a method that utilizes artificial intelligence models for thematic creativity, followed by a classification modeling of the actual painting process. The method involves converting all visual elements into quantifiable data structures before creating images. We evaluate the effectiveness of this approach in terms of semantic accuracy, image reproducibility, and computational efficiency, in comparison with existing image generation algorithms.
文本到图像生成是通过生成对抗网络(GANs)或Transformer模型进行的。然而,目前的挑战在于根据文本描述准确生成图像,尤其是在目标图像的内容和主题不明确的情况下。在本文中,我们提出了一种利用人工智能模型进行主题创意的方法,并对其进行了分类建模实际绘画过程。该方法在创建图像之前将所有视觉元素转换为可量化的数据结构。我们评估了这种方法在语义准确性、图像可重复性和计算效率方面的有效性,与现有的图像生成算法进行了比较。
https://arxiv.org/abs/2401.04116
Semantic image synthesis, i.e., generating images from user-provided semantic label maps, is an important conditional image generation task as it allows to control both the content as well as the spatial layout of generated images. Although diffusion models have pushed the state of the art in generative image modeling, the iterative nature of their inference process makes them computationally demanding. Other approaches such as GANs are more efficient as they only need a single feed-forward pass for generation, but the image quality tends to suffer on large and diverse datasets. In this work, we propose a new class of GAN discriminators for semantic image synthesis that generates highly realistic images by exploiting feature backbone networks pre-trained for tasks such as image classification. We also introduce a new generator architecture with better context modeling and using cross-attention to inject noise into latent variables, leading to more diverse generated images. Our model, which we dub DP-SIMS, achieves state-of-the-art results in terms of image quality and consistency with the input label maps on ADE-20K, COCO-Stuff, and Cityscapes, surpassing recent diffusion models while requiring two orders of magnitude less compute for inference.
语义图像合成,即从用户提供的语义标签图生成图像,是一个重要的有条件图像生成任务,因为它允许控制生成图像的内容和空间布局。尽管扩散模型在生成图像建模方面已经取得了最先进的结果,但它们的推理过程的迭代性使得它们在计算上具有挑战性。其他方法,如GANs,则更加高效,因为它们只需要一个单向前馈过程来进行生成,但在大型和多样化的数据集上,图像质量往往有所下降。在这项工作中,我们提出了一种新的GAN判别器,用于语义图像合成,它通过利用预训练的图像分类任务网络的特征骨架来生成高度逼真的图像。我们还引入了一种新的生成器架构,具有更好的上下文建模能力,并使用跨注意来注入噪声到潜在变量中,导致生成更具有多样性的图像。我们称之为DP-SIMS的模型在图像质量和与输入标签映射的 consistency方面实现了最先进的结果,超过了最近的扩散模型,而需要进行推理的两倍计算。
https://arxiv.org/abs/2312.13314