Text-to-image diffusion models are now capable of generating images that are often indistinguishable from real images. To generate such images, these models must understand the semantics of the objects they are asked to generate. In this work we show that, without any training, one can leverage this semantic knowledge within diffusion models to find semantic correspondences -- locations in multiple images that have the same semantic meaning. Specifically, given an image, we optimize the prompt embeddings of these models for maximum attention on the regions of interest. These optimized embeddings capture semantic information about the location, which can then be transferred to another image. By doing so we obtain results on par with the strongly supervised state of the art on the PF-Willow dataset and significantly outperform (20.9% relative for the SPair-71k dataset) any existing weakly or unsupervised method on PF-Willow, CUB-200 and SPair-71k datasets.
文本到图像扩散模型现在能够生成往往与真实图像难以区分的图像。要生成这些图像,这些模型必须理解它们被要求生成的对象的意义。在本文中,我们表明,在没有训练的情况下,可以利用扩散模型中的语义知识找到语义对应物——在多个图像中具有相同语义意义的地点。具体来说,给定一个图像,我们优化这些模型的即时嵌入,以最大限度地关注感兴趣的区域。这些优化的嵌入捕获了关于位置的语义信息,然后可以转移到另一个图像。通过这样做,我们在PF-威尔逊数据集上的结果与强监督的最新进展相当,并且在PF-威尔逊、CUB-200和SPair-71k数据集上显著超越了任何现有的弱或无监督方法。
https://arxiv.org/abs/2305.15581
Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.
近年来,基于扩散模型的文本到图像生成技术取得了令人瞩目的成果,能够生成高度真实和多样化的图像。然而,这些模型在从需要空间或常识推理的提示中生成图像时仍然遇到困难。我们提议使用现有的预训练大型语言模型(LLM)来改进扩散模型的推理能力,通过在新的生成阶段中使用预先训练好的模型来实现。首先,我们将LLM适应为文本引导的布局生成器,通过上下文学习来实现。当提供图像提示时,LLM以边界框和相应的个体描述形式输出场景布局。其次,我们使用新的控制器来控制扩散模型,根据布局生成图像。两个阶段都使用 frozen 预训练模型,而不需要LLM或扩散模型参数优化。我们通过证明我们的设计比基础扩散模型更能准确地根据要求生成图像,以验证其优越性。此外,我们的方法自然地允许对话式的场景描述,能够处理提示语言不属于扩散模型支持的语言。
https://arxiv.org/abs/2305.13655
Recent text-to-image generation models have demonstrated impressive capability of generating text-aligned images with high fidelity. However, generating images of novel concept provided by the user input image is still a challenging task. To address this problem, researchers have been exploring various methods for customizing pre-trained text-to-image generation models. Currently, most existing methods for customizing pre-trained text-to-image generation models involve the use of regularization techniques to prevent over-fitting. While regularization will ease the challenge of customization and leads to successful content creation with respect to text guidance, it may restrict the model capability, resulting in the loss of detailed information and inferior performance. In this work, we propose a novel framework for customized text-to-image generation without the use of regularization. Specifically, our proposed framework consists of an encoder network and a novel sampling method which can tackle the over-fitting problem without the use of regularization. With the proposed framework, we are able to customize a large-scale text-to-image generation model within half a minute on single GPU, with only one image provided by the user. We demonstrate in experiments that our proposed framework outperforms existing methods, and preserves more fine-grained details.
最近,生成文本对齐图像的人工神经网络模型表现出令人印象深刻的能力,能够生成高保真的图像。然而,从用户输入的图像生成新的创意图像仍然是一个挑战性的任务。为了解决这个问题,研究人员一直在探索各种方法来定制训练好的文本到图像生成模型。目前,大多数现有的定制文本到图像生成模型的方法都涉及使用正则化技术来防止过拟合。虽然正则化能够减轻定制的挑战,并在文本指导下成功创建内容,但它可能会限制模型的能力,导致丢失详细的信息和较差的性能。在这项工作中,我们提出了一种独特的框架,不需要使用正则化技术来定制文本到图像生成模型。具体来说,我们的框架由编码网络和一种新的采样方法组成,该方法可以在不使用正则化的情况下解决过拟合问题。通过使用该框架,可以在单个GPU上在不到一分钟的时间内定制一个大规模的文本到图像生成模型,只需要用户提供一个图像。我们实验表明,我们的框架比现有的方法表现更好,并保留了更多的细节。
https://arxiv.org/abs/2305.13579
Achieving machine autonomy and human control often represent divergent objectives in the design of interactive AI systems. Visual generative foundation models such as Stable Diffusion show promise in navigating these goals, especially when prompted with arbitrary languages. However, they often fall short in generating images with spatial, structural, or geometric controls. The integration of such controls, which can accommodate various visual conditions in a single unified model, remains an unaddressed challenge. In response, we introduce UniControl, a new generative foundation model that consolidates a wide array of controllable condition-to-image (C2I) tasks within a singular framework, while still allowing for arbitrary language prompts. UniControl enables pixel-level-precise image generation, where visual conditions primarily influence the generated structures and language prompts guide the style and context. To equip UniControl with the capacity to handle diverse visual conditions, we augment pretrained text-to-image diffusion models and introduce a task-aware HyperNet to modulate the diffusion models, enabling the adaptation to different C2I tasks simultaneously. Trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities with unseen visual conditions. Experimental results show that UniControl often surpasses the performance of single-task-controlled methods of comparable model sizes. This control versatility positions UniControl as a significant advancement in the realm of controllable visual generation.
实现机器自主和人类控制通常在互动人工智能系统的设计中代表着不同的目标。稳定的扩散等视觉生成基础模型在 navigate 这些目标方面表现出令人期望的潜力,特别是使用任意语言提示时。然而,他们通常无法生成具有空间、结构或几何控制的图像。将这些控制整合到一个单一的框架中,使其能够适应各种视觉条件,仍然是未解决的问题。为此,我们介绍了 UniControl 一个新生成基础模型,它整合了多个可控制条件到图像生成任务中,但仍允许使用任意语言提示。UniControl 实现了像素级别的精确图像生成,其中视觉条件主要影响生成的结构,语言提示指导风格和上下文。为了赋予 UniControl 处理多种视觉条件的能力,我们增加了训练过的文本到图像扩散模型,并引入了一个任务aware 的HyperNet,以调节扩散模型,同时适应不同的 C2I 任务。训练了九个独特的 C2I 任务,UniControl 展示了惊人的零样本生成能力,在从未见过的视觉条件下表现出色。实验结果显示,UniControl 通常超越相同模型大小下单一任务控制方法的性能。这种控制多样性将 UniControl 定位在可控视觉生成领域的重大进展地位。
https://arxiv.org/abs/2305.11147
Diffusion models have gained increasing attention for their impressive generation abilities but currently struggle with rendering accurate and coherent text. To address this issue, we introduce \textbf{TextDiffuser}, focusing on generating images with visually appealing text that is coherent with backgrounds. TextDiffuser consists of two stages: first, a Transformer model generates the layout of keywords extracted from text prompts, and then diffusion models generate images conditioned on the text prompt and the generated layout. Additionally, we contribute the first large-scale text images dataset with OCR annotations, \textbf{MARIO-10M}, containing 10 million image-text pairs with text recognition, detection, and character-level segmentation annotations. We further collect the \textbf{MARIO-Eval} benchmark to serve as a comprehensive tool for evaluating text rendering quality. Through experiments and user studies, we show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text. The code, model, and dataset will be available at \url{this https URL}.
扩散模型因其出色的生成能力而日益受到关注,但目前它们面临着与准确和连贯的文本渲染有关的困难。为了解决这个问题,我们引入了 \textbf{TextDiffuser},专注于生成视觉效果良好的文本与背景协调的图片。TextDiffuser分为两个阶段:首先,一个Transformer模型从文本提示中提取关键词的布局,然后扩散模型根据文本提示和生成的布局生成图像。此外,我们贡献了一个包含10百万张文本图像和OCR标注的大规模图像文本数据集 \textbf{MARIO-10M},其中包含文本识别、检测和字符级别的分割标注。我们还收集了 \textbf{MARIO-Eval} 基准数据集,作为评估文本渲染质量的全面工具。通过实验和用户研究,我们表明TextDiffuser是灵活且可控制地使用文本提示或与文本模板图像一起使用来生成高质量的文本图像,并进行文本填充以恢复不完整的图像。代码、模型和数据集将位于 \url{this https URL}。
https://arxiv.org/abs/2305.10855
This work addresses how to validate group fairness in image recognition software. We propose a distribution-aware fairness testing approach (called DistroFair) that systematically exposes class-level fairness violations in image classifiers via a synergistic combination of out-of-distribution (OOD) testing and semantic-preserving image mutation. DistroFair automatically learns the distribution (e.g., number/orientation) of objects in a set of images. Then it systematically mutates objects in the images to become OOD using three semantic-preserving image mutations -- object deletion, object insertion and object rotation. We evaluate DistroFair using two well-known datasets (CityScapes and MS-COCO) and three major, commercial image recognition software (namely, Amazon Rekognition, Google Cloud Vision and Azure Computer Vision). Results show that about 21% of images generated by DistroFair reveal class-level fairness violations using either ground truth or metamorphic oracles. DistroFair is up to 2.3x more effective than two main baselines, i.e., (a) an approach which focuses on generating images only within the distribution (ID) and (b) fairness analysis using only the original image dataset. We further observed that DistroFair is efficient, it generates 460 images per hour, on average. Finally, we evaluate the semantic validity of our approach via a user study with 81 participants, using 30 real images and 30 corresponding mutated images generated by DistroFair. We found that images generated by DistroFair are 80% as realistic as real-world images.
这项工作旨在验证图像识别软件中的群体公平性。我们提出了一种分布意识到的公平测试方法(称为 DistroFair),该方法通过协同组合分布外测试和保持语义的的图像mutation, systematic地暴露图像分类器中的类级公平违反行为,通过将对象的数量或方向分布(例如,数字或方向)从图像集合中学习。 DistroFair自动学习图像集合中对象的数量或方向分布,然后使用三个保持语义的图像mutation - 对象删除、对象插入和对象旋转 - systematic地将其转化为分布外测试(OOD)。我们使用两个著名的数据集(CityScapes和MS-COCO)以及三个主要的商用图像识别软件(即 Amazon Rekognition、Google Cloud Vision和Azure Computer Vision)来评估 DistroFair。结果显示, DistroFair生成的图像大约有21%使用地面 truth 或变形成因估计来发现类级公平违反行为。 DistroFair比两个主要基准方法更有效,即(a)一种专注于生成仅分布内的图像(ID)的方法,或(b)仅使用原始图像数据集的公平分析方法。我们还观察到, DistroFair高效,每天平均生成460个图像。最后,我们通过用户研究评估了我们方法的语义有效性,使用81个参与者使用30个真实图像和30个 DistroFair 生成的对应mutated图像。我们发现, DistroFair生成的图像有80%与现实世界图像类似。
https://arxiv.org/abs/2305.13935
Although text-to-image diffusion models have made significant strides in generating images from text, they are sometimes more inclined to generate images like the data on which the model was trained rather than the provided text. This limitation has hindered their usage in both 2D and 3D applications. To address this problem, we explored the use of negative prompts but found that the current implementation fails to produce desired results, particularly when there is an overlap between the main and negative prompts. To overcome this issue, we propose Perp-Neg, a new algorithm that leverages the geometrical properties of the score space to address the shortcomings of the current negative prompts algorithm. Perp-Neg does not require any training or fine-tuning of the model. Moreover, we experimentally demonstrate that Perp-Neg provides greater flexibility in generating images by enabling users to edit out unwanted concepts from the initially generated images in 2D cases. Furthermore, to extend the application of Perp-Neg to 3D, we conducted a thorough exploration of how Perp-Neg can be used in 2D to condition the diffusion model to generate desired views, rather than being biased toward the canonical views. Finally, we applied our 2D intuition to integrate Perp-Neg with the state-of-the-art text-to-3D (DreamFusion) method, effectively addressing its Janus (multi-head) problem.
https://arxiv.org/abs/2304.04968
Recent diffusion-based generators can produce high-quality images based only on textual prompts. However, they do not correctly interpret instructions that specify the spatial layout of the composition. We propose a simple approach that can achieve robust layout control without requiring training or fine-tuning the image generator. Our technique, which we call layout guidance, manipulates the cross-attention layers that the model uses to interface textual and visual information and steers the reconstruction in the desired direction given, e.g., a user-specified layout. In order to determine how to best guide attention, we study the role of different attention maps when generating images and experiment with two alternative strategies, forward and backward guidance. We evaluate our method quantitatively and qualitatively with several experiments, validating its effectiveness. We further demonstrate its versatility by extending layout guidance to the task of editing the layout and context of a given real image.
最近的扩散based生成器只能基于文本提示生成高质量的图像。然而,它们并没有正确解释指定了组合空间布局的指示。我们提出了一种简单的方法,可以实现可靠的空间布局控制,而不需要训练或优化图像生成器。我们称之为布局指导的技术,操纵模型使用以交互式文本和视觉信息之间的交叉注意力层,并根据给定的方向引导重构,例如用户指定的布局。为了确定如何最好地引导注意力,我们研究了生成图像时不同注意力地图的作用,并进行了 forward 和 backward 指导的两个替代策略的实验。我们使用几个实验评估了我们的方法和性能,证明了其有效性。我们还通过将布局指导扩展到编辑给定真实图像的布局和上下文任务,展示了其多功能性。
https://arxiv.org/abs/2304.03373
This paper proposes a method for generating images of customized objects specified by users. The method is based on a general framework that bypasses the lengthy optimization required by previous approaches, which often employ a per-object optimization paradigm. Our framework adopts an encoder to capture high-level identifiable semantics of objects, producing an object-specific embedding with only a single feed-forward pass. The acquired object embedding is then passed to a text-to-image synthesis model for subsequent generation. To effectively blend a object-aware embedding space into a well developed text-to-image model under the same generation context, we investigate different network designs and training strategies, and propose a simple yet effective regularized joint training scheme with an object identity preservation loss. Additionally, we propose a caption generation scheme that become a critical piece in fostering object specific embedding faithfully reflected into the generation process, while keeping control and editing abilities. Once trained, the network is able to produce diverse content and styles, conditioned on both texts and objects. We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity, without the need of test-time optimization. Systematic studies are also conducted to analyze our models, providing insights for future work.
本论文提出了一种方法,用于生成由用户指定的定制化对象的图像。这种方法基于一个通用框架,绕过了以往方法需要漫长优化的步骤,通常采用针对每个对象的优化范式。我们的框架采用编码器来捕捉对象的高度可识别语义,产生只有一个forward pass的特定对象嵌入。获取的特定对象嵌入后传递给文本到图像合成模型进行后续生成。为了在相同的生成上下文中有效地将对象意识嵌入空间与一个不断发展的文本到图像模型融合,我们研究不同的网络设计和训练策略,并提出了一个简单的但有效的 regularized Joint TrainingScheme,同时提出了一种标题生成方案,成为促进对象特定嵌入准确反映生成过程的关键部分,同时保持控制和编辑能力。一旦训练完成,网络能够产生基于文本和对象的多种内容和风格,我们通过实验证明了我们的提议方法能够生成具有令人信服的输出质量、外观多样性和对象逼真度的图像,无需测试时间优化。系统研究还分析了我们的模型,为未来工作提供了见解。
https://arxiv.org/abs/2304.02642
Deep generative models have the capacity to render high fidelity images of content like human faces. Recently, there has been substantial progress in conditionally generating images with specific quantitative attributes, like the emotion conveyed by one's face. These methods typically require a user to explicitly quantify the desired intensity of a visual attribute. A limitation of this method is that many attributes, like how "angry" a human face looks, are difficult for a user to precisely quantify. However, a user would be able to reliably say which of two faces seems "angrier". Following this premise, we develop the $\textit{PrefGen}$ system, which allows users to control the relative attributes of generated images by presenting them with simple paired comparison queries of the form "do you prefer image $a$ or image $b$?" Using information from a sequence of query responses, we can estimate user preferences over a set of image attributes and perform preference-guided image editing and generation. Furthermore, to make preference localization feasible and efficient, we apply an active query selection strategy. We demonstrate the success of this approach using a StyleGAN2 generator on the task of human face editing. Additionally, we demonstrate how our approach can be combined with CLIP, allowing a user to edit the relative intensity of attributes specified by text prompts. Code at this https URL.
深度生成模型能够生成类似于人类面部的图像内容,最近,在 conditional 生成具有特定量化属性的图像方面,取得了显著进展,例如,像人类面部传达的情感一样,这些方法通常要求用户明确地量化视觉属性的 desired 强度。这种方法的限制是,许多属性,例如人类面部看起来很“愤怒”的属性,很难精确量化。然而,用户能够可靠地地说,哪个面部似乎更“愤怒”。基于这一假设,我们开发了 $\textit{PrefGen}$ 系统,该系统允许用户通过简单的一对比较查询“do you prefer image $a$ or image $b$?” 来控制生成图像的相对属性。使用查询响应序列中的信息,我们可以估计用户对一组图像属性的偏好,并使用偏好指导的图像编辑和生成。此外,为了让偏好集中化可行和高效,我们应用了一种主动查询选择策略。我们使用 StyleGAN2 生成器演示了这种方法的成功,在人类面部编辑任务中。此外,我们还演示了如何将我们的方法和 CLIP 相结合,允许用户编辑由文本提示指定的属性相对强度。代码在此 https URL 上。
https://arxiv.org/abs/2304.00185
Recent studies show strong generative performance in domain translation especially by using transfer learning techniques on the unconditional generator. However, the control between different domain features using a single model is still challenging. Existing methods often require additional models, which is computationally demanding and leads to unsatisfactory visual quality. In addition, they have restricted control steps, which prevents a smooth transition. In this paper, we propose a new approach for high-quality domain translation with better controllability. The key idea is to preserve source features within a disentangled subspace of a target feature space. This allows our method to smoothly control the degree to which it preserves source features while generating images from an entirely new domain using only a single model. Our extensive experiments show that the proposed method can produce more consistent and realistic images than previous works and maintain precise controllability over different levels of transformation. The code is available at this https URL.
最近的研究表明,在跨域翻译中,使用无条件生成器并使用转移学习技术可以表现出强大的生成性能。然而,使用单个模型来控制不同域特征之间的控制仍然是一项挑战。现有的方法通常需要额外的模型,这会导致计算资源的浪费并产生不满意的视觉质量。此外,它们具有限制的控制步骤,这阻碍了平滑的过渡。在本文中,我们提出了一种新方法,以提供更控制性的高质量跨域翻译。其核心思想是在目标特征空间的分离子空间中保留源特征。这使我们的方法可以平滑地控制它保留源特征的程度,同时仅使用单个模型从一个全新的域中生成图像。我们的广泛实验表明,新方法能够产生比先前工作更为一致和真实的图像,并在不同的转换级别上保持精确的控制。代码在此httpsURL可用。
https://arxiv.org/abs/2303.11545
Generating images with both photorealism and multiview 3D consistency is crucial for 3D-aware GANs, yet existing methods struggle to achieve them simultaneously. Improving the photorealism via CNN-based 2D super-resolution can break the strict 3D consistency, while keeping the 3D consistency by learning high-resolution 3D representations for direct rendering often compromises image quality. In this paper, we propose a novel learning strategy, namely 3D-to-2D imitation, which enables a 3D-aware GAN to generate high-quality images while maintaining their strict 3D consistency, by letting the images synthesized by the generator's 3D rendering branch to mimic those generated by its 2D super-resolution branch. We also introduce 3D-aware convolutions into the generator for better 3D representation learning, which further improves the image generation quality. With the above strategies, our method reaches FID scores of 5.4 and 4.3 on FFHQ and AFHQ-v2 Cats, respectively, at 512x512 resolution, largely outperforming existing 3D-aware GANs using direct 3D rendering and coming very close to the previous state-of-the-art method that leverages 2D super-resolution. Project website: this https URL.
生成具有照片写实性和多角度3D一致性的图像对于3D意识GAN至关重要,但现有方法却难以同时实现。通过基于卷积神经网络的2D超分辨率可以破坏严格的3D一致性,而通过学习直接渲染的高质量3D表示常常会导致图像质量下降。在本文中,我们提出了一种新的学习方法,即3D到2D模仿,该方法可以让3D意识GAN生成高质量图像,同时保持严格的3D一致性,通过让生成器3D渲染分支生成的图像模仿其2D超分辨率分支生成的图像。我们还将3D意识Convolutions引入生成器以更好地学习3D表示,这进一步改善了图像生成质量。通过以上方法,我们的方法和FFHQ和AFHQ-v2猫在512x512分辨率下分别获得了5.4和4.3的FID得分, largely outperforms existing 3D意识GAN使用直接3D渲染并接近于之前利用2D超分辨率的优势的方法的最先进的方法。项目网站:这个https URL。
https://arxiv.org/abs/2303.09036
Text-to-image diffusion models often make implicit assumptions about the world when generating images. While some assumptions are useful (e.g., the sky is blue), they can also be outdated, incorrect, or reflective of social biases present in the training data. Thus, there is a need to control these assumptions without requiring explicit user input or costly re-training. In this work, we aim to edit a given implicit assumption in a pre-trained diffusion model. Our Text-to-Image Model Editing method, TIME for short, receives a pair of inputs: a "source" under-specified prompt for which the model makes an implicit assumption (e.g., "a pack of roses"), and a "destination" prompt that describes the same setting, but with a specified desired attribute (e.g., "a pack of blue roses"). TIME then updates the model's cross-attention layers, as these layers assign visual meaning to textual tokens. We edit the projection matrices in these layers such that the source prompt is projected close to the destination prompt. Our method is highly efficient, as it modifies a mere 2.2% of the model's parameters in under one second. To evaluate model editing approaches, we introduce TIMED (TIME Dataset), containing 147 source and destination prompt pairs from various domains. Our experiments (using Stable Diffusion) show that TIME is successful in model editing, generalizes well for related prompts unseen during editing, and imposes minimal effect on unrelated generations.
文本生成图像扩散模型在生成图像时往往对世界做出隐含假设。虽然某些假设有用(例如,天空是蓝色的),但它们也可能过时、不正确或反映了训练数据中的社会偏见。因此,我们需要控制这些假设,而不需要明确用户的输入或昂贵的重新训练。在这项工作中,我们的目标是编辑一个已训练扩散模型中的给定隐含假设。我们的文本生成图像模型编辑方法称为TIME,它接收两个输入:一个“源”未被指定prompt(例如“一束玫瑰”),和一个“目标”prompt,描述相同的场景,但具有指定的想要属性(例如“一束蓝色玫瑰”)。TIME更新模型的交叉注意力层,因为这些层将文本元表示赋予视觉意义。我们编辑这些层的投影矩阵,以便源prompt投影接近目标prompt。我们的方法非常高效,因为在仅一秒内它修改了模型参数的仅有2.2%。为了评估模型编辑方法,我们引入了TIMED(TIME数据集),其中包含来自不同领域147个源和目标prompt对。我们的实验(使用稳定扩散)表明,TIME在模型编辑中成功,对在编辑期间未看到的相关 prompt 具有很好的泛化能力,并对无关的生成产生了最小的影响。
https://arxiv.org/abs/2303.08084
Diffusion models generating images conditionally on text, such as Dall-E 2 and Stable Diffusion, have recently made a splash far beyond the computer vision community. Here, we tackle the related problem of generating point clouds, both unconditionally, and conditionally with images. For the latter, we introduce a novel geometrically-motivated conditioning scheme based on projecting sparse image features into the point cloud and attaching them to each individual point, at every step in the denoising process. This approach improves geometric consistency and yields greater fidelity than current methods relying on unstructured, global latent codes. Additionally, we show how to apply recent continuous-time diffusion schemes. Our method performs on par or above the state of art on conditional and unconditional experiments on synthetic data, while being faster, lighter, and delivering tractable likelihoods. We show it can also scale to diverse indoors scenes.
条件生成图像的扩散模型,如Dall-E 2和稳定扩散,最近在计算机视觉社区以外引起了轰动。在这里,我们解决了生成无条件和无条件与图像点云相关的问题。对于后者,我们提出了一种基于几何 motivated 的预处理方案,通过将稀疏的图像特征投影到点云上并将其附加到每个个体点,在去噪过程中每个步骤进行。这种方法改善了几何一致性并比当前基于无结构的 global 隐编码的方法生成更高的逼真度。此外,我们展示了如何应用最近的连续时间扩散方案。我们的方法在条件性和无条件的实验中表现与当前最先进的方法相当或更高,同时更快、更轻且提供了可处理的概率。我们展示了它也可以应用于多种室内场景。
https://arxiv.org/abs/2303.05916
Score-based diffusion models (SBDM) have recently emerged as state-of-the-art approaches for image generation. Existing SBDMs are typically formulated in a finite-dimensional setting, where images are considered as tensors of a finite size. This papers develops SBDMs in the infinite-dimensional setting, that is, we model the training data as functions supported on a rectangular domain. Besides the quest for generating images at ever higher resolution our primary motivation is to create a well-posed infinite-dimensional learning problem so that we can discretize it consistently on multiple resolution levels. We thereby hope to obtain diffusion models that generalize across different resolution levels and improve the efficiency of the training process. We demonstrate how to overcome two shortcomings of current SBDM approaches in the infinite-dimensional setting. First, we modify the forward process to ensure that the latent distribution is well-defined in the infinite-dimensional setting using the notion of trace class operators. Second, we illustrate that approximating the score function with an operator network, in our case Fourier neural operators (FNOs), is beneficial for multilevel training. After deriving the forward and reverse process in the infinite-dimensional setting, we show their well-posedness, derive adequate discretizations, and investigate the role of the latent distributions. We provide first promising numerical results on two datasets, MNIST and material structures. In particular, we show that multilevel training is feasible within this framework.
Score-based扩散模型(SBDM)最近崛起为生成图像的先进技术方法。现有的SBDM通常以有限维度的情况表述,图像被视为有限大小的矩阵。本文在无限维度的情况下发展了SBDM,即我们将训练数据视为支持矩形域函数的功能。除了寻求生成更高分辨率的图像外,我们的主要动机是创造一个良好的无限维度学习问题,因此我们可以在多个分辨率水平上离散化它。我们希望通过获得扩散模型,使其在不同分辨率水平上的泛化能力和训练效率得到改善。我们示范了如何在无限维度情况下克服当前SBDM方法的两个缺陷。首先,我们修改了前向过程,以确保在无限维度情况下,隐态分布的确定使用 trace class 操作定义的概念。其次,我们举例说明,用操作网络approximating评分函数,例如傅里叶神经网络(FNOs),可以提高多级训练的效果。在生成无限维度情况下的前向和反向过程之后,我们证明了它们的完备性,推导适当的离散化,并研究了隐态分布的作用。我们提供了两个数据集,米NIST和材料结构的第一个有希望的数值结果。特别是,我们表明,在这个框架内可以进行多级训练。
https://arxiv.org/abs/2303.04772
ChatGPT is attracting a cross-field interest as it provides a language interface with remarkable conversational competency and reasoning capabilities across many domains. However, since ChatGPT is trained with languages, it is currently not capable of processing or generating images from the visual world. At the same time, Visual Foundation Models, such as Visual Transformers or Stable Diffusion, although showing great visual understanding and generation capabilities, they are only experts on specific tasks with one-round fixed inputs and outputs. To this end, We build a system called \textbf{Visual ChatGPT}, incorporating different Visual Foundation Models, to enable the user to interact with ChatGPT by 1) sending and receiving not only languages but also images 2) providing complex visual questions or visual editing instructions that require the collaboration of multiple AI models with multi-steps. 3) providing feedback and asking for corrected results. We design a series of prompts to inject the visual model information into ChatGPT, considering models of multiple inputs/outputs and models that require visual feedback. Experiments show that Visual ChatGPT opens the door to investigating the visual roles of ChatGPT with the help of Visual Foundation Models. Our system is publicly available at \url{this https URL}.
ChatGPT吸引了跨领域的关注,因为它提供了跨越多个领域的出色口语能力和推理能力的语言界面。然而,由于ChatGPT的训练基于语言,它目前无法处理或生成视觉世界的图像。同时,视觉基础模型,如视觉Transformer或稳定扩散,虽然表现出强大的视觉理解和生成能力,但它们只能在具有一次性固定输入和输出的特定任务方面成为专家。因此,我们开发了名为Visual ChatGPT的系统,包括不同的视觉基础模型,以便用户可以通过1)发送和接收不仅包括语言,还包括图像的方式与ChatGPT交互2)提供复杂的视觉问题或视觉编辑指令,需要多个AI模型的协作和多步操作。3)提供反馈并请求纠正结果。我们设计了一系列提示,以注入视觉模型信息到ChatGPT中,并考虑多个输入/输出模型和需要视觉反馈的模型。实验结果表明,Visual ChatGPT打开了探索ChatGPT视觉角色的Visual Foundation模型帮助的大门。我们的系统可在url{this https URL}上公开可用。
https://arxiv.org/abs/2303.04671
We introduce X&Fuse, a general approach for conditioning on visual information when generating images from text. We demonstrate the potential of X&Fuse in three different text-to-image generation scenarios. (i) When a bank of images is available, we retrieve and condition on a related image (Retrieve&Fuse), resulting in significant improvements on the MS-COCO benchmark, gaining a state-of-the-art FID score of 6.65 in zero-shot settings. (ii) When cropped-object images are at hand, we utilize them and perform subject-driven generation (Crop&Fuse), outperforming the textual inversion method while being more than x100 faster. (iii) Having oracle access to the image scene (Scene&Fuse), allows us to achieve an FID score of 5.03 on MS-COCO in zero-shot settings. Our experiments indicate that X&Fuse is an effective, easy-to-adapt, simple, and general approach for scenarios in which the model may benefit from additional visual information.
我们介绍了X&Fuse,一种从文本生成图像时考虑视觉信息的通用方法。我们在不同的文本到图像生成场景中展示了X&Fuse的潜力。(i) 当有一组图像可用时,我们检索并Conditioning on一个相关的图像(Retrieve&Fuse),在MS-COCO基准测试中取得了6.65分的最高水平FID得分,在零样本设置下取得了。(ii) 当裁剪对象图像可用时,我们利用它们并执行主题驱动生成(Crop&Fuse),比文本逆转换方法快得多,同时比x100更快。(iii) 我们获得了图像场景的Oracle访问权限(Scene&Fuse),在零样本设置下在MS-COCO基准测试中取得了FID得分5.03。我们的实验表明,X&Fuse适用于模型可能从额外的视觉信息中获得益处的场景。
https://arxiv.org/abs/2303.01000
Nowadays, the wide application of virtual digital human promotes the comprehensive prosperity and development of digital culture supported by digital economy. The personalized portrait automatically generated by AI technology needs both the natural artistic style and human sentiment. In this paper, we propose a novel StyleIdentityGAN model, which can ensure the identity and artistry of the generated portrait at the same time. Specifically, the style-enhanced module focuses on artistic style features decoupling and transferring to improve the artistry of generated virtual face images. Meanwhile, the identity-enhanced module preserves the significant features extracted from the input photo. Furthermore, the proposed method requires a small number of reference style data. Experiments demonstrate the superiority of StyleIdentityGAN over state-of-art methods in artistry and identity effects, with comparisons done qualitatively, quantitatively and through a perceptual user study. Code has been released on Github3.
Nowadays, the widespread application of virtual digital human promotes the comprehensive prosperity and development of digital culture supported by digital economy. The personalized portrait automatically generated by AI technology needs both the natural artistic style and human sentiment. In this paper, we propose a novel StyleIdentityGAN model, which can ensure the identity and artistry of the generated portrait at the same time. Specifically, the style-enhanced module focuses on artistic style features decoupling and transferred to improve the artistry of generated virtual face images. Meanwhile, the identity-enhanced module preserves the significant features extracted from the input photo. Furthermore, the proposed method requires a small number of reference style data. experiments demonstrate the superiority of StyleIdentityGAN over state-of-art methods in artistry and identity effects, with comparisons done qualitatively, quantitatively and through a perceptual user study. Code has been released on Github3.
https://arxiv.org/abs/2303.00377
In creativity support and computational co-creativity contexts, the task of discovering appropriate prompts for use with text-to-image generative models remains difficult. In many cases the creator wishes to evoke a certain impression with the image, but the task of conferring that succinctly in a text prompt poses a challenge: affective language is nuanced, complex, and model-specific. In this work we introduce a method for generating images conditioned on desired affect, quantified using a psychometrically validated three-component approach, that can be combined with conditioning on text descriptions. We first train a neural network for estimating the affect content of text and images from semantic embeddings, and then demonstrate how this can be used to exert control over a variety of generative models. We show examples of how affect modifies the outputs, provide quantitative and qualitative analysis of its capabilities, and discuss possible extensions and use cases.
在创造力支持和计算协同创造力的背景下,发现适用于文本到图像生成模型的适当提示仍然是一项挑战性的任务。在许多情况下,创作者希望通过图像产生某种印象,但将这种印象简洁地体现在文本提示中是一项挑战:情感语言具有微妙、复杂和模型特定的特点。在这项工作中,我们介绍了一种方法,通过 psychometrically validate 的三组件方法量化生成图像和文本的情感内容,并可以结合文本描述进行条件化。我们首先训练一个神经网络,从语义嵌入中估计文本和图像的情感内容,然后演示了如何利用这种方法对多种生成模型进行控制。我们展示了影响输出的例子,提供了其能力定量和定性分析,并讨论了可能的扩展和使用场景。
https://arxiv.org/abs/2302.09742
Text-to-image synthesis refers to generating visual-realistic and semantically consistent images from given textual descriptions. Previous approaches generate an initial low-resolution image and then refine it to be high-resolution. Despite the remarkable progress, these methods are limited in fully utilizing the given texts and could generate text-mismatched images, especially when the text description is complex. We propose a novel Fine-grained text-image Fusion based Generative Adversarial Networks, dubbed FF-GAN, which consists of two modules: Fine-grained text-image Fusion Block (FF-Block) and Global Semantic Refinement (GSR). The proposed FF-Block integrates an attention block and several convolution layers to effectively fuse the fine-grained word-context features into the corresponding visual features, in which the text information is fully used to refine the initial image with more details. And the GSR is proposed to improve the global semantic consistency between linguistic and visual features during the refinement process. Extensive experiments on CUB-200 and COCO datasets demonstrate the superiority of FF-GAN over other state-of-the-art approaches in generating images with semantic consistency to the given texts.Code is available at this https URL.
文本到图像生成是指从给定文本描述中生成具有视觉真实性和语义一致性的图像。以前的方法生成一个初始的低分辨率图像,然后对其进行提高分辨率的 refine。尽管取得了巨大的进展,但这些方法在充分利用给定文本并生成与文本描述匹配的图像方面是有限的。我们提出了一种新的精细文本-图像融合生成对抗网络,称为FF-GAN,它由两个模块组成:精细文本-图像融合块(FF-Block)和全球语义优化(GSR)。FF-Block 集成了一个注意力块和多个卷积层,有效地将精细的单词上下文特征转化为相应的视觉特征,其中文本信息完全用于提高初始图像的细节。GSR 被提出用于在优化过程中改善语言和视觉特征之间的全球语义一致性。在CUB-200和COCO数据集上进行广泛的实验表明,FF-GAN 在生成与给定文本语义一致性的图像方面比其他任何先进的方法都更有效。代码在此https URL上可用。
https://arxiv.org/abs/2302.08706