Given a style-reference image as the additional image condition, text-to-image diffusion models have demonstrated impressive capabilities in generating images that possess the content of text prompts while adopting the visual style of the reference image. However, current state-of-the-art methods often struggle to disentangle content and style from style-reference images, leading to issues such as content leakages. To address this issue, we propose a masking-based method that efficiently decouples content from style without the need of tuning any model parameters. By simply masking specific elements in the style reference's image features, we uncover a critical yet under-explored principle: guiding with appropriately-selected fewer conditions (e.g., dropping several image feature elements) can efficiently avoid unwanted content flowing into the diffusion models, enhancing the style transfer performances of text-to-image diffusion models. In this paper, we validate this finding both theoretically and experimentally. Extensive experiments across various styles demonstrate the effectiveness of our masking-based method and support our theoretical results.
在给定风格参考图像作为附加条件的情况下,文本到图像的扩散模型展示出了生成同时具备文本提示内容并采用参考图像视觉风格的图片的强大能力。然而,目前最先进的方法常常难以从风格参考图中分离出内容和风格,导致诸如内容泄露等问题的发生。为解决这一问题,我们提出了一种基于掩码的方法,能够在不调整任何模型参数的情况下有效分离内容与风格。通过简单地在风格参考图像特征的特定元素上进行掩码处理,我们揭示了一个重要但尚未充分探索的原则:使用适当选择的较少条件(例如,删除几个图像特征元素)可以有效地避免不必要的内容流入扩散模型,从而提升文本到图像扩散模型的风格迁移性能。本文中,我们在理论和实验两方面验证了这一发现的有效性。跨多种风格进行的大量实验展示了我们基于掩码的方法的有效性,并支持我们的理论结果。
https://arxiv.org/abs/2502.07466
We investigate the fundamental limits of transformer-based foundation models, extending our analysis to include Visual Autoregressive (VAR) transformers. VAR represents a big step toward generating images using a novel, scalable, coarse-to-fine ``next-scale prediction'' framework. These models set a new quality bar, outperforming all previous methods, including Diffusion Transformers, while having state-of-the-art performance for image synthesis tasks. Our primary contributions establish that, for single-head VAR transformers with a single self-attention layer and single interpolation layer, the VAR Transformer is universal. From the statistical perspective, we prove that such simple VAR transformers are universal approximators for any image-to-image Lipschitz functions. Furthermore, we demonstrate that flow-based autoregressive transformers inherit similar approximation capabilities. Our results provide important design principles for effective and computationally efficient VAR Transformer strategies that can be used to extend their utility to more sophisticated VAR models in image generation and other related areas.
我们研究了基于变压器的基础模型的基本限制,并将分析范围扩展到包括视觉自回归(VAR)变压器。VAR代表了一种重要进步,利用新颖的、可扩展的由粗到精的“下一尺度预测”框架来生成图像。这些模型设立了新的质量标准,在所有先前方法(包括扩散变压器)中脱颖而出,并在图像合成任务上表现出最先进的性能。 我们的主要贡献在于证明:对于具有单个自注意力层和单一插值层的单头VAR变压器,VAR变压器是通用的。从统计学的角度来看,我们证明了如此简单的VAR变压器可以作为任何图到图Lipschitz函数的通用逼近器。此外,我们还展示了基于流的自回归变压器继承了类似的近似能力。 我们的研究结果提供了重要的设计原则,用于制定有效的和计算效率高的VAR变压器策略,这些策略可扩展其在图像生成及其他相关领域更复杂VAR模型中的应用。
https://arxiv.org/abs/2502.06167
Text-to-image generative models have shown remarkable progress in producing diverse and photorealistic outputs. In this paper, we present a comprehensive analysis of their effectiveness in creating synthetic portraits that accurately represent various demographic attributes, with a special focus on age, nationality, and gender. Our evaluation employs prompts specifying detailed profiles (e.g., Photorealistic selfie photo of a 32-year-old Canadian male), covering a broad spectrum of 212 nationalities, 30 distinct ages from 10 to 78, and balanced gender representation. We compare the generated images against ground truth age estimates from two established age estimation models to assess how faithfully age is depicted. Our findings reveal that although text-to-image models can consistently generate faces reflecting different identities, the accuracy with which they capture specific ages and do so across diverse demographic backgrounds remains highly variable. These results suggest that current synthetic data may be insufficiently reliable for high-stakes age-related tasks requiring robust precision, unless practitioners are prepared to invest in significant filtering and curation. Nevertheless, they may still be useful in less sensitive or exploratory applications, where absolute age precision is not critical.
文本到图像的生成模型在产生多样性和照片般逼真的输出方面取得了显著进展。在这篇论文中,我们对这些模型在创建能够准确反映各种人口统计特征(特别是年龄、国籍和性别)的人工肖像方面的有效性进行了全面分析。我们的评估使用了指定详细个人资料的提示语(例如,“一个32岁的加拿大男性的现实自拍照”),涵盖了212个不同国家,从10岁到78岁的30种不同的年龄段,并且性别比例均衡。我们通过与两个已建立的年龄估计模型提供的真实年龄估计进行比较来评估生成图像中年龄描绘的真实程度。 我们的研究发现表明,尽管文本到图像的模型能够持续生成反映不同身份特征的脸部图像,但它们捕捉特定年龄以及在多样化的人口统计背景下的准确性仍然存在很大的变异性。这些结果暗示当前合成数据可能不足以用于需要高度精确度的关键性年龄相关任务,除非从业者愿意投入大量精力进行过滤和筛选工作。然而,在不敏感或探索性的应用中,即使绝对的年龄精度不是关键因素,它们仍可能具有一定的实用性。
https://arxiv.org/abs/2502.03420
Existing diffusion models show great potential for identity-preserving generation. However, personalized portrait generation remains challenging due to the diversity in user profiles, including variations in appearance and lighting conditions. To address these challenges, we propose IC-Portrait, a novel framework designed to accurately encode individual identities for personalized portrait generation. Our key insight is that pre-trained diffusion models are fast learners (e.g.,100 ~ 200 steps) for in-context dense correspondence matching, which motivates the two major designs of our IC-Portrait framework. Specifically, we reformulate portrait generation into two sub-tasks: 1) Lighting-Aware Stitching: we find that masking a high proportion of the input image, e.g., 80%, yields a highly effective self-supervisory representation learning of reference image lighting. 2) View-Consistent Adaptation: we leverage a synthetic view-consistent profile dataset to learn the in-context correspondence. The reference profile can then be warped into arbitrary poses for strong spatial-aligned view conditioning. Coupling these two designs by simply concatenating latents to form ControlNet-like supervision and modeling, enables us to significantly enhance the identity preservation fidelity and stability. Extensive evaluations demonstrate that IC-Portrait consistently outperforms existing state-of-the-art methods both quantitatively and qualitatively, with particularly notable improvements in visual qualities. Furthermore, IC-Portrait even demonstrates 3D-aware relighting capabilities.
现有的扩散模型在保持身份特性的生成方面展现出了巨大的潜力。然而,由于用户资料的多样性(包括外观和光照条件的变化),个性化肖像生成仍然面临挑战。为了解决这些问题,我们提出了IC-Portrait,这是一个全新的框架,旨在准确编码个人身份以进行个性化的肖像生成。我们的关键见解是,预训练的扩散模型在上下文中密集对应匹配方面学习迅速(例如100到200步),这激励了我们IC-Portrait框架的主要设计。 具体而言,我们将肖像生成重新表述为两个子任务: 1. **光照感知拼接**:我们发现对输入图像进行高度遮挡处理(例如80%),可以非常有效地学习参考图的自监督表示,从而更好地理解光照条件。 2. **视图一致性适应**:利用合成的视图一致资料集来学习上下文中的对应关系。这使得参考轮廓能够被变形到任意姿势,从而提供强大的空间对齐视图调节。 通过简单地连接潜在变量形成类似ControlNet的监督和建模方式,将这两种设计结合起来,我们显著增强了身份保持的准确度和稳定性。广泛的评估显示,IC-Portrait在定量和定性评价中均优于现有的最先进的方法,并且在视觉质量方面有了特别明显的改进。此外,IC-Portrait还展示了3D感知重光照的能力。
https://arxiv.org/abs/2501.17159
Scene and object reconstruction is an important problem in robotics, in particular in planning collision-free trajectories or in object manipulation. This paper compares two strategies for the reconstruction of nonvisible parts of the object surface from a single RGB-D camera view. The first method, named DeepSDF predicts the Signed Distance Transform to the object surface for a given point in 3D space. The second method, named MirrorNet reconstructs the occluded objects' parts by generating images from the other side of the observed object. Experiments performed with objects from the ShapeNet dataset, show that the view-dependent MirrorNet is faster and has smaller reconstruction errors in most categories.
场景和物体重建是机器人技术中的一个重要问题,尤其是在规划无碰撞轨迹或操作对象时。本文比较了从单个RGB-D相机视角重构未见部分的对象表面的两种策略。第一种方法名为DeepSDF,它预测给定3D空间中点到物体表面的带符号距离变换(Signed Distance Transform)。第二种方法名为MirrorNet,通过生成观察对象另一侧的图像来重建被遮挡的部分。使用ShapeNet数据集中的对象进行的实验表明,在大多数类别中,视图依赖性的MirrorNet比DeepSDF更快且重构误差更小。
https://arxiv.org/abs/2501.16101
We propose a zero-shot method for generating images in arbitrary spaces (e.g., a sphere for 360° panoramas and a mesh surface for texture) using a pretrained image diffusion model. The zero-shot generation of various visual content using a pretrained image diffusion model has been explored mainly in two directions. First, Diffusion Synchronization-performing reverse diffusion processes jointly across different projected spaces while synchronizing them in the target space-generates high-quality outputs when enough conditioning is provided, but it struggles in its absence. Second, Score Distillation Sampling-gradually updating the target space data through gradient descent-results in better coherence but often lacks detail. In this paper, we reveal for the first time the interconnection between these two methods while highlighting their differences. To this end, we propose StochSync, a novel approach that combines the strengths of both, enabling effective performance with weak conditioning. Our experiments demonstrate that StochSync provides the best performance in 360° panorama generation (where image conditioning is not given), outperforming previous finetuning-based methods, and also delivers comparable results in 3D mesh texturing (where depth conditioning is provided) with previous methods.
我们提出了一种零样本方法,用于在任意空间(例如,球体用于360°全景图和网格表面用于纹理)生成图像,采用预训练的图像扩散模型。使用预训练的图像扩散模型进行各种视觉内容的零样本生成主要从两个方向进行了探索。首先,通过同时跨不同投影空间执行反向扩散过程并在目标空间中同步实现高质量输出,当有足够的条件信息时效果良好,但缺乏足够条件时则表现不佳。其次,通过梯度下降逐步更新目标空间数据的分数蒸馏采样方法产生了更好的连贯性,但往往缺少细节。在这篇论文中,我们首次揭示了这两种方法之间的相互联系,并突出了它们的不同之处。为此,我们提出了StochSync这一新方法,结合了两种方法的优点,在条件信息较弱的情况下也能实现有效性能。我们的实验表明,StochSync在360°全景图生成(此时未提供图像条件)中表现出最佳性能,优于以往基于微调的方法,并且在3D网格纹理化(其中提供了深度条件)方面与之前的方法相比结果相当。
https://arxiv.org/abs/2501.15445
RLHF techniques like DPO can significantly improve the generation quality of text-to-image diffusion models. However, these methods optimize for a single reward that aligns model generation with population-level preferences, neglecting the nuances of individual users' beliefs or values. This lack of personalization limits the efficacy of these models. To bridge this gap, we introduce PPD, a multi-reward optimization objective that aligns diffusion models with personalized preferences. With PPD, a diffusion model learns the individual preferences of a population of users in a few-shot way, enabling generalization to unseen users. Specifically, our approach (1) leverages a vision-language model (VLM) to extract personal preference embeddings from a small set of pairwise preference examples, and then (2) incorporates the embeddings into diffusion models through cross attention. Conditioning on user embeddings, the text-to-image models are fine-tuned with the DPO objective, simultaneously optimizing for alignment with the preferences of multiple users. Empirical results demonstrate that our method effectively optimizes for multiple reward functions and can interpolate between them during inference. In real-world user scenarios, with as few as four preference examples from a new user, our approach achieves an average win rate of 76\% over Stable Cascade, generating images that more accurately reflect specific user preferences.
RLHF(基于人类反馈的强化学习)技术,如DPO(分布惩罚优化),可以显著提高文本到图像扩散模型的生成质量。然而,这些方法仅针对单一奖励进行优化,该奖励旨在将模型生成与群体偏好对齐,忽视了个体用户信念或价值观中的细微差别。这种个性化缺乏限制了这些模型的有效性。 为了解决这一问题,我们引入了一种名为PPD(多奖励优化目标)的方法,它使扩散模型能够根据个人喜好进行对齐。通过PPD,一个扩散模型可以从少量的成对偏好示例中学习到用户群体中的个体偏好,并且能够泛化到未见过的新用户身上。具体而言,我们的方法包括: 1. 利用视觉-语言模型(VLM)从一对小量的个人偏好示例中提取个性化偏好数值表示。 2. 通过交叉注意力机制将这些嵌入融入扩散模型。 在基于用户的嵌入进行条件设置后,文本到图像模型使用DPO目标进行了微调,同时优化了多个用户偏好的对齐。实证结果表明,我们的方法能够有效优化多个奖励函数,并且能够在推理过程中在这几个奖励之间进行插值。 在现实世界的应用场景中,通过仅从新用户提供4个偏好示例,我们的方法就能以76%的平均胜率超越稳定级联算法(Stable Cascade),生成更加准确反映特定用户偏好的图像。
https://arxiv.org/abs/2501.06655
The task of text-to-image generation has encountered significant challenges when applied to literary works, especially poetry. Poems are a distinct form of literature, with meanings that frequently transcend beyond the literal words. To address this shortcoming, we propose a PoemToPixel framework designed to generate images that visually represent the inherent meanings of poems. Our approach incorporates the concept of prompt tuning in our image generation framework to ensure that the resulting images closely align with the poetic content. In addition, we propose the PoeKey algorithm, which extracts three key elements in the form of emotions, visual elements, and themes from poems to form instructions which are subsequently provided to a diffusion model for generating corresponding images. Furthermore, to expand the diversity of the poetry dataset across different genres and ages, we introduce MiniPo, a novel multimodal dataset comprising 1001 children's poems and images. Leveraging this dataset alongside PoemSum, we conducted both quantitative and qualitative evaluations of image generation using our PoemToPixel framework. This paper demonstrates the effectiveness of our approach and offers a fresh perspective on generating images from literary sources.
文本到图像生成任务在应用于文学作品,尤其是诗歌时遇到了重大挑战。诗歌是一种独特的文学形式,其含义往往超越了文字的字面意思。为了解决这一问题,我们提出了PoemToPixel框架,旨在生成能够视觉上表达诗歌内在意义的图像。我们的方法在图像生成框架中引入了提示调优的概念,以确保生成的图像与诗歌内容紧密相关。此外,我们还提出了一种名为PoeKey的算法,该算法从诗歌中提取三个关键元素——情感、视觉元素和主题,并将其形式化为指令,然后提供给扩散模型来生成相应的图像。 为了扩展涵盖不同流派和时代的诗歌数据集多样性,我们引入了MiniPo,这是一个新颖的多模态数据集,包含1001首儿童诗及其对应的图片。结合这个数据集与PoemSum,我们使用PoemToPixel框架进行了定量和定性的图像生成评估。 本文展示了我们的方法的有效性,并为从文学来源生成图像提供了新的视角。
https://arxiv.org/abs/2501.05839
Tumor synthesis can generate examples that AI often misses or over-detects, improving AI performance by training on these challenging cases. However, existing synthesis methods, which are typically unconditional -- generating images from random variables -- or conditioned only by tumor shapes, lack controllability over specific tumor characteristics such as texture, heterogeneity, boundaries, and pathology type. As a result, the generated tumors may be overly similar or duplicates of existing training data, failing to effectively address AI's weaknesses. We propose a new text-driven tumor synthesis approach, termed TextoMorph, that provides textual control over tumor characteristics. This is particularly beneficial for examples that confuse the AI the most, such as early tumor detection (increasing Sensitivity by +8.5%), tumor segmentation for precise radiotherapy (increasing DSC by +6.3%), and classification between benign and malignant tumors (improving Sensitivity by +8.2%). By incorporating text mined from radiology reports into the synthesis process, we increase the variability and controllability of the synthetic tumors to target AI's failure cases more precisely. Moreover, TextoMorph uses contrastive learning across different texts and CT scans, significantly reducing dependence on scarce image-report pairs (only 141 pairs used in this study) by leveraging a large corpus of 34,035 radiology reports. Finally, we have developed rigorous tests to evaluate synthetic tumors, including Text-Driven Visual Turing Test and Radiomics Pattern Analysis, showing that our synthetic tumors is realistic and diverse in texture, heterogeneity, boundaries, and pathology.
肿瘤合成能够生成人工智能常会忽略或过度检测的例子,通过训练这些具有挑战性的案例可以提高AI的性能。然而,现有的合成方法通常是无条件的——从随机变量中生成图像——或者仅由肿瘤形状进行条件控制,缺乏对特定肿瘤特征(如纹理、异质性、边界和病理类型)的可控性。因此,生成的肿瘤可能过于相似或重复现有训练数据,无法有效解决AI的弱点。我们提出了一种新的基于文本驱动的肿瘤合成方法,称为TextoMorph,它可以在文本层面对肿瘤特性进行控制。这对于那些最容易使AI困惑的例子特别有益,比如早期肿瘤检测(提高灵敏度+8.5%)、用于精确放疗的肿瘤分割(增加DSC +6.3%)以及良性与恶性肿瘤分类(提高灵敏度+8.2%)。通过将从放射学报告中提取的文本融入合成过程,我们增加了合成肿瘤的变化性和可控性,以更精准地针对AI的失败案例。此外,TextoMorph利用不同的文本和CT扫描进行对比学习,显著减少了对稀缺图像-报告配对(本研究仅使用了141对)的依赖,并通过一个包含34,035份放射学报告的大语料库来实现这一点。最后,我们开发了一系列严格的测试来评估合成肿瘤,包括基于文本驱动的视觉图灵测试和放射组学模式分析,结果表明我们的合成肿瘤在纹理、异质性、边界和病理方面是现实且多样的。
https://arxiv.org/abs/2412.18589
Image generation in the fashion domain has predominantly focused on preserving body characteristics or following input prompts, but little attention has been paid to improving the inherent fashionability of the output images. This paper presents a novel diffusion model-based approach that generates fashion images with improved fashionability while maintaining control over key attributes. Key components of our method include: 1) fashionability enhancement, which ensures that the generated images are more fashionable than the input; 2) preservation of body characteristics, encouraging the generated images to maintain the original shape and proportions of the input; and 3) automatic fashion optimization, which does not rely on manual input or external prompts. We also employ two methods to collect training data for guidance while generating and evaluating the images. In particular, we rate outfit images using fashionability scores annotated by multiple fashion experts through OpenSkill-based and five critical aspect-based pairwise comparisons. These methods provide complementary perspectives for assessing and improving the fashionability of the generated images. The experimental results show that our approach outperforms the baseline Fashion++ in generating images with superior fashionability, demonstrating its effectiveness in producing more stylish and appealing fashion images.
时尚领域的图像生成主要集中在保持身体特征或遵循输入提示上,但很少有关注提升输出图像内在的时尚性的。本文提出了一种基于扩散模型的新方法,该方法能够生成具有改进时尚性的时装图像,同时还能控制关键属性。我们方法的关键组成部分包括:1) 时尚性增强,确保生成的图像比输入更时尚;2) 保持身体特征,鼓励生成的图像维持输入原有的形状和比例;3) 自动时尚优化,不依赖于手动输入或外部提示。此外,我们也采用了两种方法来收集训练数据,在生成和评估图像时提供指导。特别是,我们通过基于OpenSkill的方法以及基于五个关键方面的成对比较,使用多位时装专家标注的时尚性评分来评定服装图像。这些方法为评估和提升生成图像的时尚性提供了互补的角度。实验结果表明,我们的方法在生成具有更优时尚性的图像方面优于基准Fashion++,证明了它在生产更具风格且吸引人的时装图像方面的有效性。
https://arxiv.org/abs/2412.18421
This paper proposes a dataset augmentation method by fine-tuning pre-trained diffusion models. Generating images using a pre-trained diffusion model with textual conditioning often results in domain discrepancy between real data and generated images. We propose a fine-tuning approach where we adapt the diffusion model by conditioning it with real images and novel text embeddings. We introduce a unique procedure called Mixing Visual Concepts (MVC) where we create novel text embeddings from image captions. The MVC enables us to generate multiple images which are diverse and yet similar to the real data enabling us to perform effective dataset augmentation. We perform comprehensive qualitative and quantitative evaluations with the proposed dataset augmentation approach showcasing both coarse-grained and finegrained changes in generated images. Our approach outperforms state-of-the-art augmentation techniques on benchmark classification tasks.
本文提出了一种通过微调预训练扩散模型来实现数据集扩增的方法。使用具有文本条件的预训练扩散模型生成图像通常会导致真实数据与生成图像之间的领域差异。我们提出了一个微调方法,即通过用真实图像和新颖的文本嵌入对扩散模型进行条件设置来调整该模型。我们引入了一种名为“混合视觉概念”(MVC)的独特程序,在此过程中,我们从图像标题中创建新的文本嵌入。借助MVC,我们可以生成多种多样但又与真实数据相似的图像,从而实现有效的数据集扩增。我们对提出的这种数据集扩增方法进行了全面的定性和定量评估,展示了生成图像在粗粒度和细粒度上的变化。我们的方法在基准分类任务上超越了最先进的扩增技术。
https://arxiv.org/abs/2412.15358
Autoregressive conditional image generation algorithms are capable of generating photorealistic images that are consistent with given textual or image conditions, and have great potential for a wide range of applications. Nevertheless, the majority of popular autoregressive image generation methods rely heavily on vector quantization, and the inherent discrete characteristic of codebook presents a considerable challenge to achieving high-quality image generation. To address this limitation, this paper introduces a novel conditional introduction network for continuous masked autoregressive models. The proposed self-control network serves to mitigate the negative impact of vector quantization on the quality of the generated images, while simultaneously enhancing the conditional control during the generation process. In particular, the self-control network is constructed upon a continuous mask autoregressive generative model, which incorporates multimodal conditional information, including text and images, into a unified autoregressive sequence in a serial manner. Through a self-attention mechanism, the network is capable of generating images that are controllable based on specific conditions. The self-control network discards the conventional cross-attention-based conditional fusion mechanism and effectively unifies the conditional and generative information within the same space, thereby facilitating more seamless learning and fusion of multimodal features.
自回归条件图像生成算法能够根据给定的文本或图像条件生成逼真的照片,并且在广泛的应用中具有巨大的潜力。然而,大多数流行的自回归图像生成方法严重依赖于向量量化,而代码簿固有的离散特性对实现高质量图像生成构成了相当大的挑战。为了解决这一局限性,本文介绍了一种用于连续掩码自回归模型的新颖条件引入网络。所提出的自我控制网络旨在减轻向量量化对生成图像质量的负面影响,并同时增强生成过程中的条件控制能力。特别地,自我控制网络基于连续掩码自回归生成模型构建,该模型以序列方式将包括文本和图像在内的多模态条件信息融合到统一的自回归序列中。通过自我注意力机制,网络能够根据特定条件生成可控的图像。自我控制网络摒弃了传统的基于交叉注意力的条件融合机制,并有效地在同一空间内统一了条件信息与生成信息,从而促进了多模态特征学习和融合的更无缝衔接。
https://arxiv.org/abs/2412.13635
Fine-grained text to image synthesis involves generating images from texts that belong to different categories. In contrast to general text to image synthesis, in fine-grained synthesis there is high similarity between images of different subclasses, and there may be linguistic discrepancy among texts describing the same image. Recent Generative Adversarial Networks (GAN), such as the Recurrent Affine Transformation (RAT) GAN model, are able to synthesize clear and realistic images from texts. However, GAN models ignore fine-grained level information. In this paper we propose an approach that incorporates an auxiliary classifier in the discriminator and a contrastive learning method to improve the accuracy of fine-grained details in images synthesized by RAT GAN. The auxiliary classifier helps the discriminator classify the class of images, and helps the generator synthesize more accurate fine-grained images. The contrastive learning method minimizes the similarity between images from different subclasses and maximizes the similarity between images from the same subclass. We evaluate on several state-of-the-art methods on the commonly used CUB-200-2011 bird dataset and Oxford-102 flower dataset, and demonstrated superior performance.
细粒度文本到图像合成涉及从属于不同类别的文本生成图像。与一般的文本到图像合成相比,细粒度合成中不同子类的图像之间存在高度相似性,并且描述同一图像的文本可能存在语言上的差异。近期的生成对抗网络(GAN),如递归仿射变换(RAT)GAN模型,能够从文本生成清晰真实的图像。然而,GAN模型忽略了细粒度级别的信息。本文提出了一种方法,在判别器中加入了辅助分类器,并采用对比学习方法来提高由RAT GAN合成的图像中的细粒度细节准确性。辅助分类器帮助判别器对图像类别进行分类,并有助于生成器合成更准确的细粒度图像。对比学习方法最小化来自不同子类别的图像之间的相似性,同时最大化同一子类别的图像间的相似性。我们在常用的CUB-200-2011鸟类数据集和Oxford-102花卉数据集上评估了多种最先进的方法,并展示了优越的性能。
https://arxiv.org/abs/2412.07196
Convolutional neural networks (CNNs) have been combined with generative adversarial networks (GANs) to create deep convolutional generative adversarial networks (DCGANs) with great success. DCGANs have been used for generating images and videos from creative domains such as fashion design and painting. A common critique of the use of DCGANs in creative applications is that they are limited in their ability to generate creative products because the generator simply learns to copy the training distribution. We explore an extension of DCGANs, creative adversarial networks (CANs). Using CANs, we generate novel, creative portraits, using the WikiArt dataset to train the network. Moreover, we introduce our extension of CANs, conditional creative adversarial networks (CCANs), and demonstrate their potential to generate creative portraits conditioned on a style label. We argue that generating products that are conditioned, or inspired, on a style label closely emulates real creative processes in which humans produce imaginative work that is still rooted in previous styles.
卷积神经网络(CNN)与生成对抗网络(GAN)的结合已成功创建了深度卷积生成对抗网络(DCGAN)。DCGAN已被用于从创意领域如时装设计和绘画中生成图像和视频。然而,对于在创意应用中使用DCGAN的一个常见批评是它们在生成创意产品方面的能力受限,因为生成器只是学习复制训练分布。我们探讨了DCGAN的一种扩展——创造性对抗网络(CAN),并利用WikiArt数据集训练网络来生成新颖的、具有创造性的肖像画。此外,我们介绍了CAN的进一步扩展——条件创造性对抗网络(CCAN),并通过示范其根据样式标签生成创意肖像的能力展示了它们的潜力。我们认为,基于或受某种风格标签启发而生成的产品能更贴近人类实际创作过程中产生富有想象力但仍植根于先前风格的作品的过程。
https://arxiv.org/abs/2412.07091
How does audio describe the world around us? In this work, we propose a method for generating images of visual scenes from diverse in-the-wild sounds. This cross-modal generation task is challenging due to the significant information gap between auditory and visual signals. We address this challenge by designing a model that aligns audio-visual modalities by enriching audio features with visual information and translating them into the visual latent space. These features are then fed into the pre-trained image generator to produce images. To enhance image quality, we use sound source localization to select audio-visual pairs with strong cross-modal correlations. Our method achieves substantially better results on the VEGAS and VGGSound datasets compared to previous work and demonstrates control over the generation process through simple manipulations to the input waveform or latent space. Furthermore, we analyze the geometric properties of the learned embedding space and demonstrate that our learning approach effectively aligns audio-visual signals for cross-modal generation. Based on this analysis, we show that our method is agnostic to specific design choices, showing its generalizability by integrating various model architectures and different types of audio-visual data.
音频是如何描述我们周围世界的?在这项工作中,我们提出了一种从各种野外声音中生成视觉场景图像的方法。这个跨模态生成任务具有挑战性,因为听觉和视觉信号之间存在显著的信息差距。我们通过设计一种模型来应对这一挑战,该模型通过丰富音频特征的视觉信息并将这些信息转换到视觉潜在空间中来对齐视听模式。然后将这些特征输入预训练的图像生成器以产生图像。为了提高图像质量,我们使用声源定位技术选择具有强跨模态相关性的视听配对数据。我们的方法在VEGAS和VGGSound数据集上取得了明显优于先前工作的结果,并且通过简单地操纵输入波形或潜在空间来展示对生成过程的控制能力。此外,我们分析了学习嵌入空间的几何属性并展示了我们的学习方法有效地实现了跨模态生成中的视听信号对齐。基于这一分析,我们表明我们的方法不依赖于特定的设计选择,在整合各种模型架构和不同类型的声音视觉数据时显示出其通用性。
https://arxiv.org/abs/2412.06209
Accurately generating images of human bodies from text remains a challenging problem for state of the art text-to-image models. Commonly observed body-related artifacts include extra or missing limbs, unrealistic poses, blurred body parts, etc. Currently, evaluation of such artifacts relies heavily on time-consuming human judgments, limiting the ability to benchmark models at scale. We address this by proposing BodyMetric, a learnable metric that predicts body realism in images. BodyMetric is trained on realism labels and multi-modal signals including 3D body representations inferred from the input image, and textual descriptions. In order to facilitate this approach, we design an annotation pipeline to collect expert ratings on human body realism leading to a new dataset for this task, namely, BodyRealism. Ablation studies support our architectural choices for BodyMetric and the importance of leveraging a 3D human body prior in capturing body-related artifacts in 2D images. In comparison to concurrent metrics which evaluate general user preference in images, BodyMetric specifically reflects body-related artifacts. We demonstrate the utility of BodyMetric through applications that were previously infeasible at scale. In particular, we use BodyMetric to benchmark the generation ability of text-to-image models to produce realistic human bodies. We also demonstrate the effectiveness of BodyMetric in ranking generated images based on the predicted realism scores.
https://arxiv.org/abs/2412.04086
Diffusion models have exhibited exciting capabilities in generating images and are also very promising for video creation. However, the inference speed of diffusion models is limited by the slow sampling process, restricting its use cases. The sequential denoising steps required for generating a single sample could take tens or hundreds of iterations and thus have become a significant bottleneck. This limitation is more salient for applications that are interactive in nature or require small latency. To address this challenge, we propose Partially Conditioned Patch Parallelism (PCPP) to accelerate the inference of high-resolution diffusion models. Using the fact that the difference between the images in adjacent diffusion steps is nearly zero, Patch Parallelism (PP) leverages multiple GPUs communicating asynchronously to compute patches of an image in multiple computing devices based on the entire image (all patches) in the previous diffusion step. PCPP develops PP to reduce computation in inference by conditioning only on parts of the neighboring patches in each diffusion step, which also decreases communication among computing devices. As a result, PCPP decreases the communication cost by around $70\%$ compared to DistriFusion (the state of the art implementation of PP) and achieves $2.36\sim 8.02\times$ inference speed-up using $4\sim 8$ GPUs compared to $2.32\sim 6.71\times$ achieved by DistriFusion depending on the computing device configuration and resolution of generation at the cost of a possible decrease in image quality. PCPP demonstrates the potential to strike a favorable trade-off, enabling high-quality image generation with substantially reduced latency.
https://arxiv.org/abs/2412.02962
We introduce RandAR, a decoder-only visual autoregressive (AR) model capable of generating images in arbitrary token orders. Unlike previous decoder-only AR models that rely on a predefined generation order, RandAR removes this inductive bias, unlocking new capabilities in decoder-only generation. Our essential design enables random order by inserting a "position instruction token" before each image token to be predicted, representing the spatial location of the next image token. Trained on randomly permuted token sequences -- a more challenging task than fixed-order generation, RandAR achieves comparable performance to its conventional raster-order counterpart. More importantly, decoder-only transformers trained from random orders acquire new capabilities. For the efficiency bottleneck of AR models, RandAR adopts parallel decoding with KV-Cache at inference time, enjoying 2.5x acceleration without sacrificing generation quality. Additionally, RandAR supports inpainting, outpainting and resolution extrapolation in a zero-shot manner. We hope RandAR inspires new directions for decoder-only visual generation models and broadens their applications across diverse scenarios. Our project page is at this https URL.
https://arxiv.org/abs/2412.01827
This paper aims to bring fine-grained expression control to identity-preserving portrait generation. Existing methods tend to synthesize portraits with either neutral or stereotypical expressions. Even when supplemented with control signals like facial landmarks, these models struggle to generate accurate and vivid expressions following user instructions. To solve this, we introduce EmojiDiff, an end-to-end solution to facilitate simultaneous dual control of fine expression and identity. Unlike the conventional methods using coarse control signals, our method directly accepts RGB expression images as input templates to provide extremely accurate and fine-grained expression control in the diffusion process. As its core, an innovative decoupled scheme is proposed to disentangle expression features in the expression template from other extraneous information, such as identity, skin, and style. On one hand, we introduce \textbf{I}D-irrelevant \textbf{D}ata \textbf{I}teration (IDI) to synthesize extremely high-quality cross-identity expression pairs for decoupled training, which is the crucial foundation to filter out identity information hidden in the expressions. On the other hand, we meticulously investigate network layer function and select expression-sensitive layers to inject reference expression features, effectively preventing style leakage from expression signals. To further improve identity fidelity, we propose a novel fine-tuning strategy named \textbf{I}D-enhanced \textbf{C}ontrast \textbf{A}lignment (ICA), which eliminates the negative impact of expression control on original identity preservation. Experimental results demonstrate that our method remarkably outperforms counterparts, achieves precise expression control with highly maintained identity, and generalizes well to various diffusion models.
https://arxiv.org/abs/2412.01254
We introduce Orthus, an autoregressive (AR) transformer that excels in generating images given textual prompts, answering questions based on visual inputs, and even crafting lengthy image-text interleaved contents. Unlike prior arts on unified multimodal modeling, Orthus simultaneously copes with discrete text tokens and continuous image features under the AR modeling principle. The continuous treatment of visual signals minimizes the information loss for both image understanding and generation while the fully AR formulation renders the characterization of the correlation between modalities straightforward. The key mechanism enabling Orthus to leverage these advantages lies in its modality-specific heads -- one regular language modeling (LM) head predicts discrete text tokens and one diffusion head generates continuous image features conditioning on the output of the backbone. We devise an efficient strategy for building Orthus -- by substituting the Vector Quantization (VQ) operation in the existing unified AR model with a soft alternative, introducing a diffusion head, and tuning the added modules to reconstruct images, we can create an Orthus-base model effortlessly (e.g., within mere 72 A100 GPU hours). Orthus-base can further embrace post-training to better model interleaved images and texts. Empirically, Orthus surpasses competing baselines including Show-o and Chameleon across standard benchmarks, achieving a GenEval score of 0.58 and an MME-P score of 1265.8 using 7B parameters. Orthus also shows exceptional mixed-modality generation capabilities, reflecting the potential for handling intricate practical generation tasks.
https://arxiv.org/abs/2412.00127