RLHF techniques like DPO can significantly improve the generation quality of text-to-image diffusion models. However, these methods optimize for a single reward that aligns model generation with population-level preferences, neglecting the nuances of individual users' beliefs or values. This lack of personalization limits the efficacy of these models. To bridge this gap, we introduce PPD, a multi-reward optimization objective that aligns diffusion models with personalized preferences. With PPD, a diffusion model learns the individual preferences of a population of users in a few-shot way, enabling generalization to unseen users. Specifically, our approach (1) leverages a vision-language model (VLM) to extract personal preference embeddings from a small set of pairwise preference examples, and then (2) incorporates the embeddings into diffusion models through cross attention. Conditioning on user embeddings, the text-to-image models are fine-tuned with the DPO objective, simultaneously optimizing for alignment with the preferences of multiple users. Empirical results demonstrate that our method effectively optimizes for multiple reward functions and can interpolate between them during inference. In real-world user scenarios, with as few as four preference examples from a new user, our approach achieves an average win rate of 76\% over Stable Cascade, generating images that more accurately reflect specific user preferences.
RLHF(基于人类反馈的强化学习)技术,如DPO(分布惩罚优化),可以显著提高文本到图像扩散模型的生成质量。然而,这些方法仅针对单一奖励进行优化,该奖励旨在将模型生成与群体偏好对齐,忽视了个体用户信念或价值观中的细微差别。这种个性化缺乏限制了这些模型的有效性。 为了解决这一问题,我们引入了一种名为PPD(多奖励优化目标)的方法,它使扩散模型能够根据个人喜好进行对齐。通过PPD,一个扩散模型可以从少量的成对偏好示例中学习到用户群体中的个体偏好,并且能够泛化到未见过的新用户身上。具体而言,我们的方法包括: 1. 利用视觉-语言模型(VLM)从一对小量的个人偏好示例中提取个性化偏好数值表示。 2. 通过交叉注意力机制将这些嵌入融入扩散模型。 在基于用户的嵌入进行条件设置后,文本到图像模型使用DPO目标进行了微调,同时优化了多个用户偏好的对齐。实证结果表明,我们的方法能够有效优化多个奖励函数,并且能够在推理过程中在这几个奖励之间进行插值。 在现实世界的应用场景中,通过仅从新用户提供4个偏好示例,我们的方法就能以76%的平均胜率超越稳定级联算法(Stable Cascade),生成更加准确反映特定用户偏好的图像。
https://arxiv.org/abs/2501.06655
The task of text-to-image generation has encountered significant challenges when applied to literary works, especially poetry. Poems are a distinct form of literature, with meanings that frequently transcend beyond the literal words. To address this shortcoming, we propose a PoemToPixel framework designed to generate images that visually represent the inherent meanings of poems. Our approach incorporates the concept of prompt tuning in our image generation framework to ensure that the resulting images closely align with the poetic content. In addition, we propose the PoeKey algorithm, which extracts three key elements in the form of emotions, visual elements, and themes from poems to form instructions which are subsequently provided to a diffusion model for generating corresponding images. Furthermore, to expand the diversity of the poetry dataset across different genres and ages, we introduce MiniPo, a novel multimodal dataset comprising 1001 children's poems and images. Leveraging this dataset alongside PoemSum, we conducted both quantitative and qualitative evaluations of image generation using our PoemToPixel framework. This paper demonstrates the effectiveness of our approach and offers a fresh perspective on generating images from literary sources.
文本到图像生成任务在应用于文学作品,尤其是诗歌时遇到了重大挑战。诗歌是一种独特的文学形式,其含义往往超越了文字的字面意思。为了解决这一问题,我们提出了PoemToPixel框架,旨在生成能够视觉上表达诗歌内在意义的图像。我们的方法在图像生成框架中引入了提示调优的概念,以确保生成的图像与诗歌内容紧密相关。此外,我们还提出了一种名为PoeKey的算法,该算法从诗歌中提取三个关键元素——情感、视觉元素和主题,并将其形式化为指令,然后提供给扩散模型来生成相应的图像。 为了扩展涵盖不同流派和时代的诗歌数据集多样性,我们引入了MiniPo,这是一个新颖的多模态数据集,包含1001首儿童诗及其对应的图片。结合这个数据集与PoemSum,我们使用PoemToPixel框架进行了定量和定性的图像生成评估。 本文展示了我们的方法的有效性,并为从文学来源生成图像提供了新的视角。
https://arxiv.org/abs/2501.05839
Tumor synthesis can generate examples that AI often misses or over-detects, improving AI performance by training on these challenging cases. However, existing synthesis methods, which are typically unconditional -- generating images from random variables -- or conditioned only by tumor shapes, lack controllability over specific tumor characteristics such as texture, heterogeneity, boundaries, and pathology type. As a result, the generated tumors may be overly similar or duplicates of existing training data, failing to effectively address AI's weaknesses. We propose a new text-driven tumor synthesis approach, termed TextoMorph, that provides textual control over tumor characteristics. This is particularly beneficial for examples that confuse the AI the most, such as early tumor detection (increasing Sensitivity by +8.5%), tumor segmentation for precise radiotherapy (increasing DSC by +6.3%), and classification between benign and malignant tumors (improving Sensitivity by +8.2%). By incorporating text mined from radiology reports into the synthesis process, we increase the variability and controllability of the synthetic tumors to target AI's failure cases more precisely. Moreover, TextoMorph uses contrastive learning across different texts and CT scans, significantly reducing dependence on scarce image-report pairs (only 141 pairs used in this study) by leveraging a large corpus of 34,035 radiology reports. Finally, we have developed rigorous tests to evaluate synthetic tumors, including Text-Driven Visual Turing Test and Radiomics Pattern Analysis, showing that our synthetic tumors is realistic and diverse in texture, heterogeneity, boundaries, and pathology.
肿瘤合成能够生成人工智能常会忽略或过度检测的例子,通过训练这些具有挑战性的案例可以提高AI的性能。然而,现有的合成方法通常是无条件的——从随机变量中生成图像——或者仅由肿瘤形状进行条件控制,缺乏对特定肿瘤特征(如纹理、异质性、边界和病理类型)的可控性。因此,生成的肿瘤可能过于相似或重复现有训练数据,无法有效解决AI的弱点。我们提出了一种新的基于文本驱动的肿瘤合成方法,称为TextoMorph,它可以在文本层面对肿瘤特性进行控制。这对于那些最容易使AI困惑的例子特别有益,比如早期肿瘤检测(提高灵敏度+8.5%)、用于精确放疗的肿瘤分割(增加DSC +6.3%)以及良性与恶性肿瘤分类(提高灵敏度+8.2%)。通过将从放射学报告中提取的文本融入合成过程,我们增加了合成肿瘤的变化性和可控性,以更精准地针对AI的失败案例。此外,TextoMorph利用不同的文本和CT扫描进行对比学习,显著减少了对稀缺图像-报告配对(本研究仅使用了141对)的依赖,并通过一个包含34,035份放射学报告的大语料库来实现这一点。最后,我们开发了一系列严格的测试来评估合成肿瘤,包括基于文本驱动的视觉图灵测试和放射组学模式分析,结果表明我们的合成肿瘤在纹理、异质性、边界和病理方面是现实且多样的。
https://arxiv.org/abs/2412.18589
Image generation in the fashion domain has predominantly focused on preserving body characteristics or following input prompts, but little attention has been paid to improving the inherent fashionability of the output images. This paper presents a novel diffusion model-based approach that generates fashion images with improved fashionability while maintaining control over key attributes. Key components of our method include: 1) fashionability enhancement, which ensures that the generated images are more fashionable than the input; 2) preservation of body characteristics, encouraging the generated images to maintain the original shape and proportions of the input; and 3) automatic fashion optimization, which does not rely on manual input or external prompts. We also employ two methods to collect training data for guidance while generating and evaluating the images. In particular, we rate outfit images using fashionability scores annotated by multiple fashion experts through OpenSkill-based and five critical aspect-based pairwise comparisons. These methods provide complementary perspectives for assessing and improving the fashionability of the generated images. The experimental results show that our approach outperforms the baseline Fashion++ in generating images with superior fashionability, demonstrating its effectiveness in producing more stylish and appealing fashion images.
时尚领域的图像生成主要集中在保持身体特征或遵循输入提示上,但很少有关注提升输出图像内在的时尚性的。本文提出了一种基于扩散模型的新方法,该方法能够生成具有改进时尚性的时装图像,同时还能控制关键属性。我们方法的关键组成部分包括:1) 时尚性增强,确保生成的图像比输入更时尚;2) 保持身体特征,鼓励生成的图像维持输入原有的形状和比例;3) 自动时尚优化,不依赖于手动输入或外部提示。此外,我们也采用了两种方法来收集训练数据,在生成和评估图像时提供指导。特别是,我们通过基于OpenSkill的方法以及基于五个关键方面的成对比较,使用多位时装专家标注的时尚性评分来评定服装图像。这些方法为评估和提升生成图像的时尚性提供了互补的角度。实验结果表明,我们的方法在生成具有更优时尚性的图像方面优于基准Fashion++,证明了它在生产更具风格且吸引人的时装图像方面的有效性。
https://arxiv.org/abs/2412.18421
This paper proposes a dataset augmentation method by fine-tuning pre-trained diffusion models. Generating images using a pre-trained diffusion model with textual conditioning often results in domain discrepancy between real data and generated images. We propose a fine-tuning approach where we adapt the diffusion model by conditioning it with real images and novel text embeddings. We introduce a unique procedure called Mixing Visual Concepts (MVC) where we create novel text embeddings from image captions. The MVC enables us to generate multiple images which are diverse and yet similar to the real data enabling us to perform effective dataset augmentation. We perform comprehensive qualitative and quantitative evaluations with the proposed dataset augmentation approach showcasing both coarse-grained and finegrained changes in generated images. Our approach outperforms state-of-the-art augmentation techniques on benchmark classification tasks.
本文提出了一种通过微调预训练扩散模型来实现数据集扩增的方法。使用具有文本条件的预训练扩散模型生成图像通常会导致真实数据与生成图像之间的领域差异。我们提出了一个微调方法,即通过用真实图像和新颖的文本嵌入对扩散模型进行条件设置来调整该模型。我们引入了一种名为“混合视觉概念”(MVC)的独特程序,在此过程中,我们从图像标题中创建新的文本嵌入。借助MVC,我们可以生成多种多样但又与真实数据相似的图像,从而实现有效的数据集扩增。我们对提出的这种数据集扩增方法进行了全面的定性和定量评估,展示了生成图像在粗粒度和细粒度上的变化。我们的方法在基准分类任务上超越了最先进的扩增技术。
https://arxiv.org/abs/2412.15358
Autoregressive conditional image generation algorithms are capable of generating photorealistic images that are consistent with given textual or image conditions, and have great potential for a wide range of applications. Nevertheless, the majority of popular autoregressive image generation methods rely heavily on vector quantization, and the inherent discrete characteristic of codebook presents a considerable challenge to achieving high-quality image generation. To address this limitation, this paper introduces a novel conditional introduction network for continuous masked autoregressive models. The proposed self-control network serves to mitigate the negative impact of vector quantization on the quality of the generated images, while simultaneously enhancing the conditional control during the generation process. In particular, the self-control network is constructed upon a continuous mask autoregressive generative model, which incorporates multimodal conditional information, including text and images, into a unified autoregressive sequence in a serial manner. Through a self-attention mechanism, the network is capable of generating images that are controllable based on specific conditions. The self-control network discards the conventional cross-attention-based conditional fusion mechanism and effectively unifies the conditional and generative information within the same space, thereby facilitating more seamless learning and fusion of multimodal features.
自回归条件图像生成算法能够根据给定的文本或图像条件生成逼真的照片,并且在广泛的应用中具有巨大的潜力。然而,大多数流行的自回归图像生成方法严重依赖于向量量化,而代码簿固有的离散特性对实现高质量图像生成构成了相当大的挑战。为了解决这一局限性,本文介绍了一种用于连续掩码自回归模型的新颖条件引入网络。所提出的自我控制网络旨在减轻向量量化对生成图像质量的负面影响,并同时增强生成过程中的条件控制能力。特别地,自我控制网络基于连续掩码自回归生成模型构建,该模型以序列方式将包括文本和图像在内的多模态条件信息融合到统一的自回归序列中。通过自我注意力机制,网络能够根据特定条件生成可控的图像。自我控制网络摒弃了传统的基于交叉注意力的条件融合机制,并有效地在同一空间内统一了条件信息与生成信息,从而促进了多模态特征学习和融合的更无缝衔接。
https://arxiv.org/abs/2412.13635
Fine-grained text to image synthesis involves generating images from texts that belong to different categories. In contrast to general text to image synthesis, in fine-grained synthesis there is high similarity between images of different subclasses, and there may be linguistic discrepancy among texts describing the same image. Recent Generative Adversarial Networks (GAN), such as the Recurrent Affine Transformation (RAT) GAN model, are able to synthesize clear and realistic images from texts. However, GAN models ignore fine-grained level information. In this paper we propose an approach that incorporates an auxiliary classifier in the discriminator and a contrastive learning method to improve the accuracy of fine-grained details in images synthesized by RAT GAN. The auxiliary classifier helps the discriminator classify the class of images, and helps the generator synthesize more accurate fine-grained images. The contrastive learning method minimizes the similarity between images from different subclasses and maximizes the similarity between images from the same subclass. We evaluate on several state-of-the-art methods on the commonly used CUB-200-2011 bird dataset and Oxford-102 flower dataset, and demonstrated superior performance.
细粒度文本到图像合成涉及从属于不同类别的文本生成图像。与一般的文本到图像合成相比,细粒度合成中不同子类的图像之间存在高度相似性,并且描述同一图像的文本可能存在语言上的差异。近期的生成对抗网络(GAN),如递归仿射变换(RAT)GAN模型,能够从文本生成清晰真实的图像。然而,GAN模型忽略了细粒度级别的信息。本文提出了一种方法,在判别器中加入了辅助分类器,并采用对比学习方法来提高由RAT GAN合成的图像中的细粒度细节准确性。辅助分类器帮助判别器对图像类别进行分类,并有助于生成器合成更准确的细粒度图像。对比学习方法最小化来自不同子类别的图像之间的相似性,同时最大化同一子类别的图像间的相似性。我们在常用的CUB-200-2011鸟类数据集和Oxford-102花卉数据集上评估了多种最先进的方法,并展示了优越的性能。
https://arxiv.org/abs/2412.07196
Convolutional neural networks (CNNs) have been combined with generative adversarial networks (GANs) to create deep convolutional generative adversarial networks (DCGANs) with great success. DCGANs have been used for generating images and videos from creative domains such as fashion design and painting. A common critique of the use of DCGANs in creative applications is that they are limited in their ability to generate creative products because the generator simply learns to copy the training distribution. We explore an extension of DCGANs, creative adversarial networks (CANs). Using CANs, we generate novel, creative portraits, using the WikiArt dataset to train the network. Moreover, we introduce our extension of CANs, conditional creative adversarial networks (CCANs), and demonstrate their potential to generate creative portraits conditioned on a style label. We argue that generating products that are conditioned, or inspired, on a style label closely emulates real creative processes in which humans produce imaginative work that is still rooted in previous styles.
卷积神经网络(CNN)与生成对抗网络(GAN)的结合已成功创建了深度卷积生成对抗网络(DCGAN)。DCGAN已被用于从创意领域如时装设计和绘画中生成图像和视频。然而,对于在创意应用中使用DCGAN的一个常见批评是它们在生成创意产品方面的能力受限,因为生成器只是学习复制训练分布。我们探讨了DCGAN的一种扩展——创造性对抗网络(CAN),并利用WikiArt数据集训练网络来生成新颖的、具有创造性的肖像画。此外,我们介绍了CAN的进一步扩展——条件创造性对抗网络(CCAN),并通过示范其根据样式标签生成创意肖像的能力展示了它们的潜力。我们认为,基于或受某种风格标签启发而生成的产品能更贴近人类实际创作过程中产生富有想象力但仍植根于先前风格的作品的过程。
https://arxiv.org/abs/2412.07091
How does audio describe the world around us? In this work, we propose a method for generating images of visual scenes from diverse in-the-wild sounds. This cross-modal generation task is challenging due to the significant information gap between auditory and visual signals. We address this challenge by designing a model that aligns audio-visual modalities by enriching audio features with visual information and translating them into the visual latent space. These features are then fed into the pre-trained image generator to produce images. To enhance image quality, we use sound source localization to select audio-visual pairs with strong cross-modal correlations. Our method achieves substantially better results on the VEGAS and VGGSound datasets compared to previous work and demonstrates control over the generation process through simple manipulations to the input waveform or latent space. Furthermore, we analyze the geometric properties of the learned embedding space and demonstrate that our learning approach effectively aligns audio-visual signals for cross-modal generation. Based on this analysis, we show that our method is agnostic to specific design choices, showing its generalizability by integrating various model architectures and different types of audio-visual data.
音频是如何描述我们周围世界的?在这项工作中,我们提出了一种从各种野外声音中生成视觉场景图像的方法。这个跨模态生成任务具有挑战性,因为听觉和视觉信号之间存在显著的信息差距。我们通过设计一种模型来应对这一挑战,该模型通过丰富音频特征的视觉信息并将这些信息转换到视觉潜在空间中来对齐视听模式。然后将这些特征输入预训练的图像生成器以产生图像。为了提高图像质量,我们使用声源定位技术选择具有强跨模态相关性的视听配对数据。我们的方法在VEGAS和VGGSound数据集上取得了明显优于先前工作的结果,并且通过简单地操纵输入波形或潜在空间来展示对生成过程的控制能力。此外,我们分析了学习嵌入空间的几何属性并展示了我们的学习方法有效地实现了跨模态生成中的视听信号对齐。基于这一分析,我们表明我们的方法不依赖于特定的设计选择,在整合各种模型架构和不同类型的声音视觉数据时显示出其通用性。
https://arxiv.org/abs/2412.06209
Accurately generating images of human bodies from text remains a challenging problem for state of the art text-to-image models. Commonly observed body-related artifacts include extra or missing limbs, unrealistic poses, blurred body parts, etc. Currently, evaluation of such artifacts relies heavily on time-consuming human judgments, limiting the ability to benchmark models at scale. We address this by proposing BodyMetric, a learnable metric that predicts body realism in images. BodyMetric is trained on realism labels and multi-modal signals including 3D body representations inferred from the input image, and textual descriptions. In order to facilitate this approach, we design an annotation pipeline to collect expert ratings on human body realism leading to a new dataset for this task, namely, BodyRealism. Ablation studies support our architectural choices for BodyMetric and the importance of leveraging a 3D human body prior in capturing body-related artifacts in 2D images. In comparison to concurrent metrics which evaluate general user preference in images, BodyMetric specifically reflects body-related artifacts. We demonstrate the utility of BodyMetric through applications that were previously infeasible at scale. In particular, we use BodyMetric to benchmark the generation ability of text-to-image models to produce realistic human bodies. We also demonstrate the effectiveness of BodyMetric in ranking generated images based on the predicted realism scores.
https://arxiv.org/abs/2412.04086
Diffusion models have exhibited exciting capabilities in generating images and are also very promising for video creation. However, the inference speed of diffusion models is limited by the slow sampling process, restricting its use cases. The sequential denoising steps required for generating a single sample could take tens or hundreds of iterations and thus have become a significant bottleneck. This limitation is more salient for applications that are interactive in nature or require small latency. To address this challenge, we propose Partially Conditioned Patch Parallelism (PCPP) to accelerate the inference of high-resolution diffusion models. Using the fact that the difference between the images in adjacent diffusion steps is nearly zero, Patch Parallelism (PP) leverages multiple GPUs communicating asynchronously to compute patches of an image in multiple computing devices based on the entire image (all patches) in the previous diffusion step. PCPP develops PP to reduce computation in inference by conditioning only on parts of the neighboring patches in each diffusion step, which also decreases communication among computing devices. As a result, PCPP decreases the communication cost by around $70\%$ compared to DistriFusion (the state of the art implementation of PP) and achieves $2.36\sim 8.02\times$ inference speed-up using $4\sim 8$ GPUs compared to $2.32\sim 6.71\times$ achieved by DistriFusion depending on the computing device configuration and resolution of generation at the cost of a possible decrease in image quality. PCPP demonstrates the potential to strike a favorable trade-off, enabling high-quality image generation with substantially reduced latency.
https://arxiv.org/abs/2412.02962
We introduce RandAR, a decoder-only visual autoregressive (AR) model capable of generating images in arbitrary token orders. Unlike previous decoder-only AR models that rely on a predefined generation order, RandAR removes this inductive bias, unlocking new capabilities in decoder-only generation. Our essential design enables random order by inserting a "position instruction token" before each image token to be predicted, representing the spatial location of the next image token. Trained on randomly permuted token sequences -- a more challenging task than fixed-order generation, RandAR achieves comparable performance to its conventional raster-order counterpart. More importantly, decoder-only transformers trained from random orders acquire new capabilities. For the efficiency bottleneck of AR models, RandAR adopts parallel decoding with KV-Cache at inference time, enjoying 2.5x acceleration without sacrificing generation quality. Additionally, RandAR supports inpainting, outpainting and resolution extrapolation in a zero-shot manner. We hope RandAR inspires new directions for decoder-only visual generation models and broadens their applications across diverse scenarios. Our project page is at this https URL.
https://arxiv.org/abs/2412.01827
This paper aims to bring fine-grained expression control to identity-preserving portrait generation. Existing methods tend to synthesize portraits with either neutral or stereotypical expressions. Even when supplemented with control signals like facial landmarks, these models struggle to generate accurate and vivid expressions following user instructions. To solve this, we introduce EmojiDiff, an end-to-end solution to facilitate simultaneous dual control of fine expression and identity. Unlike the conventional methods using coarse control signals, our method directly accepts RGB expression images as input templates to provide extremely accurate and fine-grained expression control in the diffusion process. As its core, an innovative decoupled scheme is proposed to disentangle expression features in the expression template from other extraneous information, such as identity, skin, and style. On one hand, we introduce \textbf{I}D-irrelevant \textbf{D}ata \textbf{I}teration (IDI) to synthesize extremely high-quality cross-identity expression pairs for decoupled training, which is the crucial foundation to filter out identity information hidden in the expressions. On the other hand, we meticulously investigate network layer function and select expression-sensitive layers to inject reference expression features, effectively preventing style leakage from expression signals. To further improve identity fidelity, we propose a novel fine-tuning strategy named \textbf{I}D-enhanced \textbf{C}ontrast \textbf{A}lignment (ICA), which eliminates the negative impact of expression control on original identity preservation. Experimental results demonstrate that our method remarkably outperforms counterparts, achieves precise expression control with highly maintained identity, and generalizes well to various diffusion models.
https://arxiv.org/abs/2412.01254
We introduce Orthus, an autoregressive (AR) transformer that excels in generating images given textual prompts, answering questions based on visual inputs, and even crafting lengthy image-text interleaved contents. Unlike prior arts on unified multimodal modeling, Orthus simultaneously copes with discrete text tokens and continuous image features under the AR modeling principle. The continuous treatment of visual signals minimizes the information loss for both image understanding and generation while the fully AR formulation renders the characterization of the correlation between modalities straightforward. The key mechanism enabling Orthus to leverage these advantages lies in its modality-specific heads -- one regular language modeling (LM) head predicts discrete text tokens and one diffusion head generates continuous image features conditioning on the output of the backbone. We devise an efficient strategy for building Orthus -- by substituting the Vector Quantization (VQ) operation in the existing unified AR model with a soft alternative, introducing a diffusion head, and tuning the added modules to reconstruct images, we can create an Orthus-base model effortlessly (e.g., within mere 72 A100 GPU hours). Orthus-base can further embrace post-training to better model interleaved images and texts. Empirically, Orthus surpasses competing baselines including Show-o and Chameleon across standard benchmarks, achieving a GenEval score of 0.58 and an MME-P score of 1265.8 using 7B parameters. Orthus also shows exceptional mixed-modality generation capabilities, reflecting the potential for handling intricate practical generation tasks.
https://arxiv.org/abs/2412.00127
Visual communication, dating back to prehistoric cave paintings, is the use of visual elements to convey ideas and information. In today's visually saturated world, effective design demands an understanding of graphic design principles, visual storytelling, human psychology, and the ability to distill complex information into clear visuals. This dissertation explores how recent advancements in vision-language models (VLMs) can be leveraged to automate the creation of effective visual communication designs. Although generative models have made great progress in generating images from text, they still struggle to simplify complex ideas into clear, abstract visuals and are constrained by pixel-based outputs, which lack flexibility for many design tasks. To address these challenges, we constrain the models' operational space and introduce task-specific regularizations. We explore various aspects of visual communication, namely, sketches and visual abstraction, typography, animation, and visual inspiration.
https://arxiv.org/abs/2411.18727
Personalized image generation has been significantly advanced, enabling the creation of highly realistic and customized images. However, existing methods often struggle with generating images of multiple people due to occlusions and fail to accurately personalize full-body shapes. In this paper, we propose PersonaCraft, a novel approach that combines diffusion models with 3D human modeling to address these limitations. Our method effectively manages occlusions by incorporating 3D-aware pose conditioning with SMPLx-ControlNet and accurately personalizes human full-body shapes through SMPLx fitting. Additionally, PersonaCraft enables user-defined body shape adjustments, adding flexibility for individual body customization. Experimental results demonstrate the superior performance of PersonaCraft in generating high-quality, realistic images of multiple individuals while resolving occlusion issues, thus establishing a new standard for multi-person personalized image synthesis. Project page: this https URL
个性化图像生成技术已经取得了显著的进步,能够创造出高度逼真和个性化的图像。然而,现有的方法在生成多个人物的图像时常常因为遮挡问题而遇到困难,并且无法准确地个性化全身形状。在这篇论文中,我们提出了PersonaCraft,这是一种结合扩散模型与3D人体建模的新方法,以解决这些限制。我们的方法通过引入3D感知姿态条件控制(使用SMPLx-ControlNet)有效管理了遮挡问题,并通过SMPLx拟合准确地个性化了人类的全身形状。此外,PersonaCraft还允许用户定义身体形状调整,为个体的身体定制增加了灵活性。实验结果表明,PersonaCraft在生成高质量、逼真的多个人物图像方面表现出色,并解决了遮挡问题,从而确立了多人个性化图像合成的新标准。项目页面:这个 https URL
https://arxiv.org/abs/2411.18068
Diffusion models have revolutionized image generation, yet several challenges restrict their application to large-image domains, such as digital pathology and satellite imagery. Given that it is infeasible to directly train a model on 'whole' images from domains with potential gigapixel sizes, diffusion-based generative methods have focused on synthesizing small, fixed-size patches extracted from these images. However, generating small patches has limited applicability since patch-based models fail to capture the global structures and wider context of large images, which can be crucial for synthesizing (semantically) accurate samples. In this paper, to overcome this limitation, we present ZoomLDM, a diffusion model tailored for generating images across multiple scales. Central to our approach is a novel magnification-aware conditioning mechanism that utilizes self-supervised learning (SSL) embeddings and allows the diffusion model to synthesize images at different 'zoom' levels, i.e., fixed-size patches extracted from large images at varying scales. ZoomLDM achieves state-of-the-art image generation quality across all scales, excelling particularly in the data-scarce setting of generating thumbnails of entire large images. The multi-scale nature of ZoomLDM unlocks additional capabilities in large image generation, enabling computationally tractable and globally coherent image synthesis up to $4096 \times 4096$ pixels and $4\times$ super-resolution. Additionally, multi-scale features extracted from ZoomLDM are highly effective in multiple instance learning experiments. We provide high-resolution examples of the generated images on our website this https URL.
扩散模型彻底改变了图像生成领域,然而一些挑战限制了它们在大型图像领域的应用,比如数字病理学和卫星影像。由于直接对潜在的数十亿像素大小的“完整”图像进行训练是不切实际的,在这些领域中,基于扩散的生成方法主要集中于合成从这些图像中提取的小型固定尺寸块。然而,生成小型补丁的应用范围有限,因为基于补丁的方法无法捕捉到大型图像的整体结构和更广泛的上下文信息,这对于合成(语义上)准确的样本可能是至关重要的。 在这篇论文中,为了解决这一限制,我们提出了ZoomLDM,一种专为多尺度图像生成定制的扩散模型。我们的方法核心是一种新颖的比例感知条件机制,该机制利用自监督学习(SSL)嵌入,并允许扩散模型在不同的“缩放”级别上合成图像,即从大型图像中提取的不同比例下的固定尺寸块。 ZoomLDM在所有尺度上的图像生成质量达到了最先进的水平,在稀缺数据环境下尤其擅长于生成整个大型图像的缩略图。ZoomLDM的多尺度特性解锁了大型图像生成中的额外能力,使得能够进行计算上可行且全局一致的图像合成,最高可达$4096 \times 4096$像素和$4\times$超分辨率。 此外,从ZoomLDM中提取的多尺度特征在多次实例学习实验中表现出高度有效性。我们可以在我们的网站(此处提供此 https URL)上查看生成图像的高分辨率示例。
https://arxiv.org/abs/2411.16969
While text-to-image generation has been extensively studied, generating images from scene graphs remains relatively underexplored, primarily due to challenges in accurately modeling spatial relationships and object interactions. To fill this gap, we introduce Scene-Bench, a comprehensive benchmark designed to evaluate and enhance the factual consistency in generating natural scenes. Scene-Bench comprises MegaSG, a large-scale dataset of one million images annotated with scene graphs, facilitating the training and fair comparison of models across diverse and complex scenes. Additionally, we propose SGScore, a novel evaluation metric that leverages chain-of-thought reasoning capabilities of multimodal large language models (LLMs) to assess both object presence and relationship accuracy, offering a more effective measure of factual consistency than traditional metrics like FID and CLIPScore. Building upon this evaluation framework, we develop a scene graph feedback pipeline that iteratively refines generated images by identifying and correcting discrepancies between the scene graph and the image. Extensive experiments demonstrate that Scene-Bench provides a more comprehensive and effective evaluation framework compared to existing benchmarks, particularly for complex scene generation. Furthermore, our feedback strategy significantly enhances the factual consistency of image generation models, advancing the field of controllable image generation.
尽管文本到图像的生成已经得到了广泛的研究,但从场景图生成图像仍然相对较少被探索,主要是因为准确建模空间关系和对象互动存在挑战。为了填补这一空白,我们引入了Scene-Bench,这是一个旨在评估并提高自然场景生成中事实一致性的全面基准测试。Scene-Bench 包含 MegaSG,一个由一百万张带有场景图注释的大型数据集,有助于在多样且复杂的场景中训练和公平比较模型。此外,我们提出了 SGScore,这是一种新颖的评价指标,利用多模态大语言模型(LLMs)的链式思考推理能力来评估对象存在及其关系的准确性,相比传统的FID和CLIPScore等度量标准,提供了更有效的事实一致性衡量方法。基于这一评估框架,我们开发了一个场景图反馈管道,通过识别并修正图像与场景图之间的差异来迭代地改进生成的图像。广泛的实验表明,Scene-Bench 提供了比现有基准测试更加全面且有效的评估框架,特别是在复杂场景生成方面。此外,我们的反馈策略显著提升了图像生成模型的事实一致性,推动了可控图像生成领域的进步。
https://arxiv.org/abs/2411.15435
Diffusion models based on Multi-Head Attention (MHA) have become ubiquitous to generate high quality images and videos. However, encoding an image or a video as a sequence of patches results in costly attention patterns, as the requirements both in terms of memory and compute grow quadratically. To alleviate this problem, we propose a drop-in replacement for MHA called the Polynomial Mixer (PoM) that has the benefit of encoding the entire sequence into an explicit state. PoM has a linear complexity with respect to the number of tokens. This explicit state also allows us to generate frames in a sequential fashion, minimizing memory and compute requirement, while still being able to train in parallel. We show the Polynomial Mixer is a universal sequence-to-sequence approximator, just like regular MHA. We adapt several Diffusion Transformers (DiT) for generating images and videos with PoM replacing MHA, and we obtain high quality samples while using less computational resources. The code is available at this https URL.
基于多头注意力(MHA)的扩散模型已成为生成高质量图像和视频的普遍选择。然而,将图像或视频编码为一系列补丁会导致昂贵的注意力模式,因为内存和计算需求呈二次增长。为了缓解这一问题,我们提出了一种名为多项式混合器(PoM)的多头注意力替代方案,它可以将整个序列编码到一个显式的状态中,从而带来好处。多项式混合器相对于标记数量具有线性复杂度。这种显式状态还允许我们以顺序方式生成帧,最小化内存和计算需求,同时仍能进行并行训练。我们展示了多项式混合器是一种通用的序列到序列近似器,就像普通的多头注意力一样。我们将多项式混合器替代扩散变压器(DiT)中的多头注意力,用于图像和视频生成,并在使用较少计算资源的同时获得高质量样本。代码可在以下网址获取:[此处为提供的链接]。
https://arxiv.org/abs/2411.12663
Blind image restoration remains a significant challenge in low-level vision tasks. Recently, denoising diffusion models have shown remarkable performance in image synthesis. Guided diffusion models, leveraging the potent generative priors of pre-trained models along with a differential guidance loss, have achieved promising results in blind image restoration. However, these models typically consider data consistency solely in the spatial domain, often resulting in distorted image content. In this paper, we propose a novel frequency-aware guidance loss that can be integrated into various diffusion models in a plug-and-play manner. Our proposed guidance loss, based on 2D discrete wavelet transform, simultaneously enforces content consistency in both the spatial and frequency domains. Experimental results demonstrate the effectiveness of our method in three blind restoration tasks: blind image deblurring, imaging through turbulence, and blind restoration for multiple degradations. Notably, our method achieves a significant improvement in PSNR score, with a remarkable enhancement of 3.72\,dB in image deblurring. Moreover, our method exhibits superior capability in generating images with rich details and reduced distortion, leading to the best visual quality.
盲图像恢复在低级视觉任务中仍是一个重大的挑战。最近,去噪扩散模型在图像合成方面表现出了卓越的性能。引导扩散模型利用预训练模型的强大生成先验以及微分引导损失,在盲图像恢复上取得了有希望的结果。然而,这些模型通常仅在空间域考虑数据一致性,常常导致图像内容扭曲。本文中,我们提出了一种新型的频谱感知引导损失,可以以即插即用的方式集成到各种扩散模型中。我们的引导损失基于2D离散小波变换,在空间和频率域同时强制执行内容一致性。实验结果展示了我们在三种盲恢复任务中的方法有效性:盲图像去模糊、湍流成像以及针对多种退化的盲恢复。值得注意的是,我们的方法在PSNR得分上实现了显著的提升,在图像去模糊方面提高了3.72 dB。此外,我们提出的方法在生成细节丰富且失真减少的图像方面表现出色,从而达到最佳视觉质量。
https://arxiv.org/abs/2411.12450