We propose a novel framework for ID-preserving generation using a multi-modal encoding strategy rather than injecting identity features via adapters into pre-trained models. Our method treats identity and text as a unified conditioning input. To achieve this, we introduce FaceCLIP, a multi-modal encoder that learns a joint embedding space for both identity and textual semantics. Given a reference face and a text prompt, FaceCLIP produces a unified representation that encodes both identity and text, which conditions a base diffusion model to generate images that are identity-consistent and text-aligned. We also present a multi-modal alignment algorithm to train FaceCLIP, using a loss that aligns its joint representation with face, text, and image embedding spaces. We then build FaceCLIP-SDXL, an ID-preserving image synthesis pipeline by integrating FaceCLIP with Stable Diffusion XL (SDXL). Compared to prior methods, FaceCLIP-SDXL enables photorealistic portrait generation with better identity preservation and textual relevance. Extensive experiments demonstrate its quantitative and qualitative superiority.
我们提出了一种新颖的框架,用于通过多模态编码策略而非使用适配器向预训练模型注入身份特征来进行保持身份特性的生成。我们的方法将身份和文本视为统一的条件输入。为此,我们引入了FaceCLIP,这是一种多模态编码器,能够为身份和文本语义学习联合嵌入空间。给定一个参考人脸图像和一段文字提示,FaceCLIP可以产生一种同时包含身份信息和文本内容的统一表示形式,这种表示形式能条件化基础扩散模型以生成既符合身份又与文本相关的图像。此外,我们还提出了一种多模态对齐算法来训练FaceCLIP,该算法使用一种损失函数将其联合表示与人脸、文本及图像嵌入空间进行对齐。接着,我们将FaceCLIP与Stable Diffusion XL (SDXL)集成起来构建了FaceCLIP-SDXL,这是一种保持身份特性的图像合成流水线。相比之前的方法,FaceCLIP-SDXL能够生成更逼真的肖像图片,并且在身份保存和文本相关性方面表现更好。广泛的实验表明其具有定量和定性的优越性能。
https://arxiv.org/abs/2504.14202
In this work, we introduce MedIL, a first-of-its-kind autoencoder built for encoding medical images with heterogeneous sizes and resolutions for image generation. Medical images are often large and heterogeneous, where fine details are of vital clinical importance. Image properties change drastically when considering acquisition equipment, patient demographics, and pathology, making realistic medical image generation challenging. Recent work in latent diffusion models (LDMs) has shown success in generating images resampled to a fixed-size. However, this is a narrow subset of the resolutions native to image acquisition, and resampling discards fine anatomical details. MedIL utilizes implicit neural representations to treat images as continuous signals, where encoding and decoding can be performed at arbitrary resolutions without prior resampling. We quantitatively and qualitatively show how MedIL compresses and preserves clinically-relevant features over large multi-site, multi-resolution datasets of both T1w brain MRIs and lung CTs. We further demonstrate how MedIL can influence the quality of images generated with a diffusion model, and discuss how MedIL can enhance generative models to resemble raw clinical acquisitions.
在这项工作中,我们介绍了MedIL,这是一种首创的自动编码器,专门用于编码大小和分辨率各异的医学影像以生成图像。医学影像通常很大且异质化,在这些影像中,细微细节对于临床诊断至关重要。当考虑获取设备、患者人口统计学以及病理情况时,图像特性会发生剧烈变化,这使得真实的医学影像生成变得具有挑战性。最近在潜在扩散模型(LDMs)方面的工作已成功地用于生成调整到固定大小的图像,但这种方法仅适用于原始获取中有限的一组分辨率,并且重采样会丢失细微的解剖细节。MedIL利用隐式神经表示将图像视为连续信号,在这种情况下编码和解码可以在无需先进行重采样的条件下在任意分辨率下完成。 我们通过定量和定性的方法展示了MedIL如何对大型多中心、多分辨率的数据集中的T1加权脑MRI和肺部CT影像的临床相关特征进行压缩并保持。此外,我们还演示了MedIL是如何影响与扩散模型生成图像的质量,并讨论了MedIL可以如何增强生成模型以更接近原始临床获取的效果。
https://arxiv.org/abs/2504.09322
Creating a realistic animatable avatar from a single static portrait remains challenging. Existing approaches often struggle to capture subtle facial expressions, the associated global body movements, and the dynamic background. To address these limitations, we propose a novel framework that leverages a pretrained video diffusion transformer model to generate high-fidelity, coherent talking portraits with controllable motion dynamics. At the core of our work is a dual-stage audio-visual alignment strategy. In the first stage, we employ a clip-level training scheme to establish coherent global motion by aligning audio-driven dynamics across the entire scene, including the reference portrait, contextual objects, and background. In the second stage, we refine lip movements at the frame level using a lip-tracing mask, ensuring precise synchronization with audio signals. To preserve identity without compromising motion flexibility, we replace the commonly used reference network with a facial-focused cross-attention module that effectively maintains facial consistency throughout the video. Furthermore, we integrate a motion intensity modulation module that explicitly controls expression and body motion intensity, enabling controllable manipulation of portrait movements beyond mere lip motion. Extensive experimental results show that our proposed approach achieves higher quality with better realism, coherence, motion intensity, and identity preservation. Ours project page: this https URL.
从单张静态肖像创建一个逼真的可动画化身仍然具有挑战性。现有方法往往难以捕捉微妙的面部表情、相关的全身动作以及动态背景。为了解决这些限制,我们提出了一种新颖框架,利用预训练的视频扩散变换器模型生成高保真度、连贯的说话头像,并且可以控制运动动力学。我们的工作核心是一种双阶段音频-视觉对齐策略。 在第一阶段,我们采用片段级训练方案,通过在整个场景中(包括参考肖像、上下文对象和背景)对准由音频驱动的动力学来建立连贯的整体运动。在第二阶段,我们使用唇部跟踪掩码以帧为单位细化嘴唇动作,确保与音频信号的精确同步。 为了保持身份一致性而不牺牲运动灵活性,我们将常用的参考网络替换为面部聚焦的跨注意力模块,该模块在整个视频中有效维持面部一致性。此外,我们整合了一个运动强度调节模块,它明确控制表情和身体运动强度,从而实现头像动作(不仅仅是唇部动作)可操控地调整。 广泛的实验结果表明,我们的方法在质量和现实感、连贯性、运动强度和身份保持方面均优于现有技术。 有关我们的项目的更多信息,请访问此链接:[项目页面链接] (请将“this https URL”替换为实际的项目页面URL)。
https://arxiv.org/abs/2504.04842
Generating naturalistic and nuanced listener motions for extended interactions remains an open problem. Existing methods often rely on low-dimensional motion codes for facial behavior generation followed by photorealistic rendering, limiting both visual fidelity and expressive richness. To address these challenges, we introduce DiTaiListener, powered by a video diffusion model with multimodal conditions. Our approach first generates short segments of listener responses conditioned on the speaker's speech and facial motions with DiTaiListener-Gen. It then refines the transitional frames via DiTaiListener-Edit for a seamless transition. Specifically, DiTaiListener-Gen adapts a Diffusion Transformer (DiT) for the task of listener head portrait generation by introducing a Causal Temporal Multimodal Adapter (CTM-Adapter) to process speakers' auditory and visual cues. CTM-Adapter integrates speakers' input in a causal manner into the video generation process to ensure temporally coherent listener responses. For long-form video generation, we introduce DiTaiListener-Edit, a transition refinement video-to-video diffusion model. The model fuses video segments into smooth and continuous videos, ensuring temporal consistency in facial expressions and image quality when merging short video segments produced by DiTaiListener-Gen. Quantitatively, DiTaiListener achieves the state-of-the-art performance on benchmark datasets in both photorealism (+73.8% in FID on RealTalk) and motion representation (+6.1% in FD metric on VICO) spaces. User studies confirm the superior performance of DiTaiListener, with the model being the clear preference in terms of feedback, diversity, and smoothness, outperforming competitors by a significant margin.
生成自然且细腻的听者动作,以支持长时间互动的问题仍然未得到解决。现有方法通常依赖于低维运动代码来生成面部行为,并随后进行逼真的渲染,这既限制了视觉保真度也削弱了表现力的丰富性。为了应对这些挑战,我们引入了DiTaiListener,它是由多模态条件下的视频扩散模型驱动的。我们的方法首先使用DiTaiListener-Gen根据说话人的语音和面部动作生成听众反应的短片段。然后通过DiTaiListener-Edit改进过渡帧以实现无缝连接。 具体来说,DiTaiListener-Gen采用了一种经过改编的Diffusion Transformer(DiT)用于听者头像生成任务,并引入了一个因果时间多模态适配器(CTM-Adapter),用以处理说话人的音频和视觉线索。CTM-Adapter将说话人输入以因果方式整合到视频生成过程中,确保了在产生连贯且一致的听众反应时的时间连续性。 对于长时间视频生成,我们引入了DiTaiListener-Edit,这是一个用于过渡细化的视频到视频扩散模型。该模型融合短片段视频以生成流畅且连贯的长视频,确保在将由DiTaiListener-Gen产生的短视频片段合并后,在面部表情和图像质量方面的时间一致性。 从量化指标来看,DiTaiListener在基准数据集上的表现达到了最先进的水平,分别在逼真度(RealTalk数据集上FID得分提升73.8%)和运动表示能力(VICO数据集上FD指标提高6.1%)。用户研究证实了DiTaiListener的优越性,模型在反馈、多样性和流畅性方面明显优于竞争对手。
https://arxiv.org/abs/2504.04010
Text-to-image diffusion models excel at generating diverse portraits, but lack intuitive shadow control. Existing editing approaches, as post-processing, struggle to offer effective manipulation across diverse styles. Additionally, these methods either rely on expensive real-world light-stage data collection or require extensive computational resources for training. To address these limitations, we introduce Shadow Director, a method that extracts and manipulates hidden shadow attributes within well-trained diffusion models. Our approach uses a small estimation network that requires only a few thousand synthetic images and hours of training-no costly real-world light-stage data needed. Shadow Director enables parametric and intuitive control over shadow shape, placement, and intensity during portrait generation while preserving artistic integrity and identity across diverse styles. Despite training only on synthetic data built on real-world identities, it generalizes effectively to generated portraits with diverse styles, making it a more accessible and resource-friendly solution.
文本到图像的扩散模型在生成多样的肖像方面表现出色,但缺乏直观的阴影控制。现有的编辑方法作为后处理手段,在处理不同风格时难以提供有效的操作。此外,这些方法要么依赖于昂贵的真实世界光舞台数据收集,要么需要大量的计算资源进行训练。为了解决这些问题,我们介绍了Shadow Director方法,该方法可以从已经训练好的扩散模型中提取并操纵隐藏的阴影属性。我们的方法使用一个小型估计网络,只需要几千张合成图像和几个小时的训练时间——无需昂贵的真实世界光舞台数据。 Shadow Director在生成肖像时提供了参数化且直观的阴影形状、位置及强度控制,并能在保持艺术完整性和身份一致性的前提下应用于各种风格中。尽管仅基于真实世界的身份构建并经过少量合成数据训练,它仍然能够有效地推广到具有多样风格的生成肖像上,使其成为一个更易于使用和资源友好的解决方案。
https://arxiv.org/abs/2503.21943
Painting textures for existing geometries is a critical yet labor-intensive process in 3D asset generation. Recent advancements in text-to-image (T2I) models have led to significant progress in texture generation. Most existing research approaches this task by first generating images in 2D spaces using image diffusion models, followed by a texture baking process to achieve UV texture. However, these methods often struggle to produce high-quality textures due to inconsistencies among the generated multi-view images, resulting in seams and ghosting artifacts. In contrast, 3D-based texture synthesis methods aim to address these inconsistencies, but they often neglect 2D diffusion model priors, making them challenging to apply to real-world objects To overcome these limitations, we propose RomanTex, a multiview-based texture generation framework that integrates a multi-attention network with an underlying 3D representation, facilitated by our novel 3D-aware Rotary Positional Embedding. Additionally, we incorporate a decoupling characteristic in the multi-attention block to enhance the model's robustness in image-to-texture task, enabling semantically-correct back-view synthesis. Furthermore, we introduce a geometry-related Classifier-Free Guidance (CFG) mechanism to further improve the alignment with both geometries and images. Quantitative and qualitative evaluations, along with comprehensive user studies, demonstrate that our method achieves state-of-the-art results in texture quality and consistency.
为现有几何体绘制纹理是3D资产生成中的一个关键但耗时的过程。最近,文本到图像(T2I)模型的进展在纹理生成方面取得了显著进步。大多数现有的研究方法首先使用图像扩散模型在二维空间中生成图像,然后通过烘焙过程实现UV纹理。然而,这些方法往往难以产生高质量的纹理,因为它们生成的多视角图像之间存在不一致性,从而导致接缝和鬼影(ghosting)伪影。相比之下,基于3D的方法旨在解决这些问题,但通常忽视了二维扩散模型的先验知识,使其在应用于现实世界对象时具有挑战性。 为克服这些限制,我们提出了RomanTex,这是一种多视角纹理生成框架,它将一个多注意力网络与底层3D表示相结合,并通过我们的新型3D感知旋转位置嵌入(Rotary Positional Embedding)来实现。此外,我们在多注意力块中引入了一种解耦特性,以增强模型在图像到纹理任务中的鲁棒性,从而能够生成语义正确的背面合成结果。我们还引入了一种与几何相关的无分类器引导(Classifier-Free Guidance, CFG)机制,进一步提高纹理与几何和图像的一致性。 定量评估、定性评估以及全面的用户研究证明了我们的方法在纹理质量和一致性方面取得了最先进的成果。
https://arxiv.org/abs/2503.19011
Suffering from performance bottlenecks in passively detecting high-quality Deepfake images due to the advancement of generative models, proactive perturbations offer a promising approach to disabling Deepfake manipulations by inserting signals into benign images. However, existing proactive perturbation approaches remain unsatisfactory in several aspects: 1) visual degradation due to direct element-wise addition; 2) limited effectiveness against face swapping manipulation; 3) unavoidable reliance on white- and grey-box settings to involve generative models during training. In this study, we analyze the essence of Deepfake face swapping and argue the necessity of protecting source identities rather than target images, and we propose NullSwap, a novel proactive defense approach that cloaks source image identities and nullifies face swapping under a pure black-box scenario. We design an Identity Extraction module to obtain facial identity features from the source image, while a Perturbation Block is then devised to generate identity-guided perturbations accordingly. Meanwhile, a Feature Block extracts shallow-level image features, which are then fused with the perturbation in the Cloaking Block for image reconstruction. Furthermore, to ensure adaptability across different identity extractors in face swapping algorithms, we propose Dynamic Loss Weighting to adaptively balance identity losses. Experiments demonstrate the outstanding ability of our approach to fool various identity recognition models, outperforming state-of-the-art proactive perturbations in preventing face swapping models from generating images with correct source identities.
由于生成模型的进步,被动检测高质量Deepfake图像时会遇到性能瓶颈。在这种情况下,通过在良性图片中插入信号来主动干扰的方法提供了一种有前景的策略,以阻止Deepfake操纵。然而,现有的主动干扰方法在几个方面仍不尽如人意:1)直接元素级添加导致视觉质量下降;2)对抗面部交换操作的有效性有限;3)不可避免地依赖于白盒和灰盒设置,在训练过程中涉及生成模型。 在这项研究中,我们分析了Deepfake面部交换的本质,并主张保护源身份而非目标图像的重要性。为此,我们提出了一种名为NullSwap的新主动防御方法,该方法在纯黑盒场景下隐藏源图像的身份并使面部交换无效。具体来说: 1. **Identity Extraction模块**:从源图像中获取面部身份特征。 2. **Perturbation Block**:根据上述身份信息生成引导扰动。 3. **Feature Block和Cloaking Block**:前者提取浅层图像特征,后者则将这些特征与扰动融合进行图像重建。 此外,为了确保在不同面部交换算法中的适应性,我们提出了动态损失加权(Dynamic Loss Weighting),以自适应地平衡身份损失。实验结果表明,我们的方法能够有效地欺骗各种身份识别模型,并且优于现有的主动干扰技术,在防止Deepfake生成带有正确源身份的图像方面具有更好的性能。
https://arxiv.org/abs/2503.18678
In this paper, we address the problem of generative dataset distillation that utilizes generative models to synthesize images. The generator may produce any number of images under a preserved evaluation time. In this work, we leverage the popular diffusion model as the generator to compute a surrogate dataset, boosted by a min-max loss to control the dataset's diversity and representativeness during training. However, the diffusion model is time-consuming when generating images, as it requires an iterative generation process. We observe a critical trade-off between the number of image samples and the image quality controlled by the diffusion steps and propose Diffusion Step Reduction to achieve optimal performance. This paper details our comprehensive method and its performance. Our model achieved $2^{nd}$ place in the generative track of \href{this https URL}{The First Dataset Distillation Challenge of ECCV2024}, demonstrating its superior performance.
在这篇论文中,我们解决了利用生成模型合成图像的生成数据集提炼问题。生成器可以在保持评估时间不变的情况下产生任意数量的图像。在此研究中,我们采用流行的扩散模型作为生成器来计算一个替代数据集,并通过最小-最大损失函数在训练过程中控制数据集的多样性和代表性。然而,当生成图像时,扩散模型会非常耗时,因为它需要进行迭代生成过程。我们观察到,在由扩散步骤控制的图像质量和样本数量之间存在关键的权衡关系,并提出减少扩散步数的方法以达到最佳性能。本文详细介绍了我们的全面方法及其表现情况。在欧洲计算机视觉会议(ECCV2024)首次数据集提炼挑战赛的生成赛道中,我们的模型获得了第二名的好成绩,展示了其优越的表现能力。
https://arxiv.org/abs/2503.18626
Personalized portrait synthesis, essential in domains like social entertainment, has recently made significant progress. Person-wise fine-tuning based methods, such as LoRA and DreamBooth, can produce photorealistic outputs but need training on individual samples, consuming time and resources and posing an unstable risk. Adapter based techniques such as IP-Adapter freeze the foundational model parameters and employ a plug-in architecture to enable zero-shot inference, but they often exhibit a lack of naturalness and authenticity, which are not to be overlooked in portrait synthesis tasks. In this paper, we introduce a parameter-efficient adaptive generation method, namely HyperLoRA, that uses an adaptive plug-in network to generate LoRA weights, merging the superior performance of LoRA with the zero-shot capability of adapter scheme. Through our carefully designed network structure and training strategy, we achieve zero-shot personalized portrait generation (supporting both single and multiple image inputs) with high photorealism, fidelity, and editability.
个人肖像合成技术在社交娱乐等领域中至关重要,最近取得了显著进展。基于个人微调的方法(如LoRA和DreamBooth)可以生成逼真的图像输出,但需要对每个样本进行训练,这会消耗大量时间和资源,并且存在不稳定的隐患。而基于适配器的技术(例如IP-Adapter),冻结基础模型参数并采用插件架构以实现零样本推理,但在肖像合成任务中往往缺乏自然感和真实性。 在本文中,我们提出了一种参数高效的自适应生成方法——HyperLoRA,该方法使用一个自适应的插件网络来生成LoRA权重,从而结合了LoRA的优越性能与适配器方案的零样本推理能力。通过精心设计的网络结构和训练策略,我们的方法能够实现高逼真度、保真度及可编辑性的零样本个性化肖像生成(支持单图或多图输入)。
https://arxiv.org/abs/2503.16944
Large-scale text-to-image (T2I) diffusion models have revolutionized image generation, enabling the synthesis of highly detailed visuals from textual descriptions. However, these models may inadvertently generate inappropriate content, such as copyrighted works or offensive images. While existing methods attempt to eliminate specific unwanted concepts, they often fail to ensure complete removal, allowing the concept to reappear in subtle forms. For instance, a model may successfully avoid generating images in Van Gogh's style when explicitly prompted with 'Van Gogh', yet still reproduce his signature artwork when given the prompt 'Starry Night'. In this paper, we propose SAFER, a novel and efficient approach for thoroughly removing target concepts from diffusion models. At a high level, SAFER is inspired by the observed low-dimensional structure of the text embedding space. The method first identifies a concept-specific subspace $S_c$ associated with the target concept c. It then projects the prompt embeddings onto the complementary subspace of $S_c$, effectively erasing the concept from the generated images. Since concepts can be abstract and difficult to fully capture using natural language alone, we employ textual inversion to learn an optimized embedding of the target concept from a reference image. This enables more precise subspace estimation and enhances removal performance. Furthermore, we introduce a subspace expansion strategy to ensure comprehensive and robust concept erasure. Extensive experiments demonstrate that SAFER consistently and effectively erases unwanted concepts from diffusion models while preserving generation quality.
大规模的文本到图像(T2I)扩散模型已经革新了图像生成技术,能够从文本描述中合成出高度详细的视觉内容。然而,这些模型可能会无意间生成不适当的内容,例如受版权保护的作品或冒犯性的图像。现有的方法试图消除特定的不需要的概念,但往往无法确保完全移除这些概念,导致它们以微妙的形式重新出现。例如,当明确提示为“Van Gogh”时,模型可能成功避免生成梵高风格的图像,但在收到提示“星夜”的情况下,仍然会复制他的标志性作品。在这篇论文中,我们提出了SAFER(Safe and Efficient Removal),这是一种新的且高效的从扩散模型中彻底移除目标概念的方法。 总体而言,SAFER受到了文本嵌入空间低维结构观察到的现象的启发。该方法首先识别与目标概念c相关的特定子空间$S_c$。然后将提示嵌入投影到$S_c$的补子空间上,从而有效地从生成的图像中擦除概念。由于概念可能很抽象且难以仅通过自然语言完全捕捉,我们使用文本反转技术来从参考图像中学习目标概念的优化嵌入,这使更精确的子空间估计成为可能,并增强了移除性能。 此外,我们引入了一种子空间扩展策略,以确保全面和稳健的概念擦除。广泛的实验表明,SAFER能够持续且有效地从扩散模型中移除不需要的概念,同时保持生成的质量。
https://arxiv.org/abs/2503.16835
Super-resolution is aimed at reconstructing high-resolution images from low-resolution observations. State-of-the-art approaches underpinned with deep learning allow for obtaining outstanding results, generating images of high perceptual quality. However, it often remains unclear whether the reconstructed details are close to the actual ground-truth information and whether they constitute a more valuable source for image analysis algorithms. In the reported work, we address the latter problem, and we present our efforts toward learning super-resolution algorithms in a task-driven way to make them suitable for generating high-resolution images that can be exploited for automated image analysis. In the reported initial research, we propose a methodological approach for assessing the existing models that perform computer vision tasks in terms of whether they can be used for evaluating super-resolution reconstruction algorithms, as well as training them in a task-driven way. We support our analysis with experimental study and we expect it to establish a solid foundation for selecting appropriate computer vision tasks that will advance the capabilities of real-world super-resolution.
超分辨率技术旨在从低分辨率观测中重建高质量的高分辨率图像。基于深度学习的最新方法能够生成具有极高感知质量的图像,取得了出色的效果。然而,通常不清楚这些重建细节是否接近实际的真实信息以及它们是否构成更有利于图像分析算法的数据源。在本报告的工作中,我们解决了后者的问题,并展示了我们在以任务驱动的方式学习超分辨率算法方面的努力,以便使其适合生成可用于自动化图像分析的高分辨率图像。 在初步研究中,我们提出了一种方法论,用于评估现有的执行计算机视觉任务的模型是否可以用来评价超分辨率重建算法的质量,以及训练它们进行任务导向的学习。我们的分析得到了实验研究的支持,并期望它为选择适当的计算机视觉任务奠定坚实基础,这些任务将提升现实世界中超分辨率技术的能力。
https://arxiv.org/abs/2503.15474
Generating images with embedded text is crucial for the automatic production of visual and multimodal documents, such as educational materials and advertisements. However, existing diffusion-based text-to-image models often struggle to accurately embed text within images, facing challenges in spelling accuracy, contextual relevance, and visual coherence. Evaluating the ability of such models to embed text within a generated image is complicated due to the lack of comprehensive benchmarks. In this work, we introduce TextInVision, a large-scale, text and prompt complexity driven benchmark designed to evaluate the ability of diffusion models to effectively integrate visual text into images. We crafted a diverse set of prompts and texts that consider various attributes and text characteristics. Additionally, we prepared an image dataset to test Variational Autoencoder (VAE) models across different character representations, highlighting that VAE architectures can also pose challenges in text generation within diffusion frameworks. Through extensive analysis of multiple models, we identify common errors and highlight issues such as spelling inaccuracies and contextual mismatches. By pinpointing the failure points across different prompts and texts, our research lays the foundation for future advancements in AI-generated multimodal content.
将包含嵌入文本的图像生成对于自动生产视觉和多模态文档(如教育材料和广告)至关重要。然而,现有的基于扩散的文字到图像模型在准确地将文字嵌入图像方面经常遇到困难,面临着拼写准确性、上下文相关性和视觉连贯性等方面的挑战。由于缺乏全面的基准测试,评估此类模型在其生成的图像中嵌入文本的能力变得复杂。在这项工作中,我们引入了TextInVision,这是一个大规模的、以文本和提示复杂度驱动的基准测试系统,旨在评估扩散模型将视觉文字有效整合到图像中的能力。我们精心设计了一组多样化的提示语和文本,考虑了各种属性和文本特征。此外,我们还准备了一个图像数据集来测试变分自编码器(VAE)模型在不同字符表示上的性能,并强调VAE架构同样可能面临在扩散框架内生成文字方面的挑战。通过对多个模型进行广泛的分析,我们识别出常见的错误并指出了诸如拼写不准确和上下文不符等问题。通过确定不同类型提示语和文本中的失败点,我们的研究为未来AI生成的多模态内容的进步奠定了基础。
https://arxiv.org/abs/2503.13730
Audio-driven single-image talking portrait generation plays a crucial role in virtual reality, digital human creation, and filmmaking. Existing approaches are generally categorized into keypoint-based and image-based methods. Keypoint-based methods effectively preserve character identity but struggle to capture fine facial details due to the fixed points limitation of the 3D Morphable Model. Moreover, traditional generative networks face challenges in establishing causality between audio and keypoints on limited datasets, resulting in low pose diversity. In contrast, image-based approaches produce high-quality portraits with diverse details using the diffusion network but incur identity distortion and expensive computational costs. In this work, we propose KDTalker, the first framework to combine unsupervised implicit 3D keypoint with a spatiotemporal diffusion model. Leveraging unsupervised implicit 3D keypoints, KDTalker adapts facial information densities, allowing the diffusion process to model diverse head poses and capture fine facial details flexibly. The custom-designed spatiotemporal attention mechanism ensures accurate lip synchronization, producing temporally consistent, high-quality animations while enhancing computational efficiency. Experimental results demonstrate that KDTalker achieves state-of-the-art performance regarding lip synchronization accuracy, head pose diversity, and execution this http URL codes are available at this https URL.
音频驱动的单张图像说话肖像生成在虚拟现实、数字人创建和电影制作中扮演着重要角色。现有的方法通常被分为基于关键点的方法和基于图像的方法。基于关键点的方法有效地保留了人物的身份,但因3D可变形模型(3D Morphable Model)固定点限制而难以捕捉到精细的面部细节。此外,传统的生成网络在使用有限数据集建立音频与关键点之间的因果关系时面临挑战,导致姿势多样性较低。相比之下,基于图像的方法通过扩散网络能够产生具有丰富细节和高质量的人物肖像,但会导致身份失真,并且计算成本高昂。 在这项工作中,我们提出了KDTalker框架,这是第一个结合无监督隐式3D关键点与时空扩散模型的框架。利用无监督隐式的3D关键点,KDTalker可以根据面部信息密度调整面部信息,使得扩散过程能够灵活地模拟多样的头部姿势,并捕捉到细微的面部细节。此外,自定义设计的时空注意力机制确保了准确的唇部同步,在提高计算效率的同时产生了时间一致性高、高质量的动画。 实验结果表明,KDTalker在唇部同步准确性、头部姿态多样性以及执行效率方面均达到了最先进的水平。代码可在提供的链接中获取。
https://arxiv.org/abs/2503.12963
In this paper, we propose a unified layout planning and image generation model, PlanGen, which can pre-plan spatial layout conditions before generating images. Unlike previous diffusion-based models that treat layout planning and layout-to-image as two separate models, PlanGen jointly models the two tasks into one autoregressive transformer using only next-token prediction. PlanGen integrates layout conditions into the model as context without requiring specialized encoding of local captions and bounding box coordinates, which provides significant advantages over the previous embed-and-pool operations on layout conditions, particularly when dealing with complex layouts. Unified prompting allows PlanGen to perform multitasking training related to layout, including layout planning, layout-to-image generation, image layout understanding, etc. In addition, PlanGen can be seamlessly expanded to layout-guided image manipulation thanks to the well-designed modeling, with teacher-forcing content manipulation policy and negative layout guidance. Extensive experiments verify the effectiveness of our PlanGen in multiple layoutrelated tasks, showing its great potential. Code is available at: this https URL.
在这篇论文中,我们提出了一种统一的布局规划和图像生成模型——PlanGen。该模型能够在生成图像之前预先计划空间布局条件。与以往基于扩散的方法将布局规划和从布局到图像转换视为两个独立模型不同,PlanGen 将这两个任务整合为一个自回归变压器模型,并仅通过下一个标记预测来实现这一目标。PlanGen 在不专门编码局部说明文本和边界框坐标的情况下,将布局条件集成到模型上下文中,这在处理复杂布局时相比之前的嵌入并池化操作提供了显著的优势。 统一的提示机制使 PlanGen 能够进行与布局相关的多任务训练,包括布局规划、从布局生成图像、理解图像中的布局等。此外,由于其设计得当,PlanGen 还可以无缝扩展到基于布局引导的图像编辑功能,并且通过教师强制内容操作策略和负向布局指导来实现这一点。 广泛的实验验证了 PlanGen 在多个与布局相关的任务中的有效性,展示了其巨大的潜力。代码可在以下网址获取:[此处插入具体链接]。
https://arxiv.org/abs/2503.10127
Recent Customized Portrait Generation (CPG) methods, taking a facial image and a textual prompt as inputs, have attracted substantial attention. Although these methods generate high-fidelity portraits, they fail to prevent the generated portraits from being tracked and misused by malicious face recognition systems. To address this, this paper proposes a Customized Portrait Generation framework with facial Adversarial attacks (Adv-CPG). Specifically, to achieve facial privacy protection, we devise a lightweight local ID encryptor and an encryption enhancer. They implement progressive double-layer encryption protection by directly injecting the target identity and adding additional identity guidance, respectively. Furthermore, to accomplish fine-grained and personalized portrait generation, we develop a multi-modal image customizer capable of generating controlled fine-grained facial features. To the best of our knowledge, Adv-CPG is the first study that introduces facial adversarial attacks into CPG. Extensive experiments demonstrate the superiority of Adv-CPG, e.g., the average attack success rate of the proposed Adv-CPG is 28.1% and 2.86% higher compared to the SOTA noise-based attack methods and unconstrained attack methods, respectively.
最近定制化肖像生成(CPG)方法,通过面部图像和文本提示作为输入,引起了广泛关注。尽管这些方法能够生成高保真的面部画像,但它们无法防止生成的画像被恶意人脸识别系统追踪并滥用。为了解决这个问题,本文提出了一种结合了面部对抗攻击的定制化肖像生成框架(Adv-CPG)。为了实现面部隐私保护,我们设计了一个轻量级的身份加密器和一个增强器。身份加密器通过直接注入目标身份信息来实施渐进式的双层加密保护;而增强器则是通过添加额外的身份指导来加强这一过程。 此外,为了完成精细且个性化的肖像生成,我们开发了一种多模态图像定制器,能够生成受控的细粒度面部特征。据我们所知,Adv-CPG是第一个将面部对抗攻击引入到CPG中的研究工作。广泛的实验展示了Adv-CPG的优势,例如:提出的Adv-CPG方法平均攻击成功率达到了28.1%,比最先进的基于噪声的方法高出2.86%。 翻译成中文如下: 最近定制化肖像生成(Customized Portrait Generation, CPG)方法因其能够利用面部图像和文本提示作为输入而吸引了大量关注。尽管这些方法可以生成高保真的面部画像,但它们无法防止生成的画像被恶意的人脸识别系统追踪并滥用。为了解决这个问题,本文提出了一种结合了面部对抗攻击(Adversarial attacks, Adv)的定制化肖像生成框架(Adv-CPG)。为了实现面部隐私保护,我们设计了一个轻量级的身份加密器和一个增强器。身份加密器通过直接注入目标身份信息来实施渐进式的双层加密保护;而增强器则是通过添加额外的身份指导来加强这一过程。 此外,为了完成精细且个性化的肖像生成,我们开发了一种多模态图像定制器,能够生成受控的细粒度面部特征。据我们所知,Adv-CPG是第一个将面部对抗攻击引入到CPG中的研究工作。广泛的实验展示了Adv-CPG的优势,例如:提出的Adv-CPG方法平均攻击成功率达到了28.1%,比最先进的基于噪声的方法高出2.86%。
https://arxiv.org/abs/2503.08269
Diffusion models have become widely adopted in image completion tasks, with text prompts commonly employed to ensure semantic coherence by providing high-level guidance. However, a persistent challenge arises when an object is partially obscured in the damaged region, yet its remaining parts are still visible in the background. While text prompts offer semantic direction, they often fail to precisely recover fine-grained structural details, such as the object's overall posture, ensuring alignment with the visible object information in the background. This limitation stems from the inability of text prompts to provide pixel-level specificity. To address this, we propose supplementing text-based guidance with a novel visual aid: a casual sketch, which can be roughly drawn by anyone based on visible object parts. This sketch supplies critical structural cues, enabling the generative model to produce an object structure that seamlessly integrates with the existing background. We introduce the Visual Sketch Self-Aware (VSSA) model, which integrates the casual sketch into each iterative step of the diffusion process, offering distinct advantages for partially corrupted scenarios. By blending sketch-derived features with those of the corrupted image, and leveraging text prompt guidance, the VSSA assists the diffusion model in generating images that preserve both the intended object semantics and structural consistency across the restored objects and original regions. To support this research, we created two datasets, CUB-sketch and MSCOCO-sketch, each combining images, sketches, and text. Extensive qualitative and quantitative experiments demonstrate that our approach outperforms several state-of-the-art methods.
扩散模型在图像补全任务中被广泛采用,通常使用文本提示来确保语义连贯性并提供高层次的指导。然而,当目标物体的部分区域被损坏遮挡时,尽管其剩余部分仍可在背景中看到,这便成为一个持久性的挑战。虽然文本提示提供了语义方向,但它们往往无法精确恢复细微结构细节,例如物体的整体姿态,以确保与背景中可见物体信息的一致性。这一限制源于文本提示无法提供像素级别的具体信息。 为了解决这个问题,我们提出了一种新颖的方法:在基于文本的指导之外添加视觉辅助——一个简单的草图,任何人都可以根据看到的目标物部分轻松绘制出来。这个草图提供了关键的结构线索,使生成模型能够产出与现有背景无缝融合的对象结构。为此,我们引入了“视觉草图自我感知”(Visual Sketch Self-Aware, VSSA)模型,该模型将草图融入扩散过程中的每一步迭代中,在部分损坏的情况下提供独特的优势。 通过结合从草图和受损图像衍生出的特征,并利用文本提示指导,VSSA帮助扩散模型生成既能保持意图对象语义又能确保修复区域与原区结构一致性的图像。为了支持这项研究,我们创建了两个数据集:CUB-sketch 和 MSCOCO-sketch,每个都结合了图像、草图和文本信息。通过广泛的质量和数量的实验验证,我们的方法在多个最先进的技术对比中表现出色。
https://arxiv.org/abs/2503.07047
Generative modeling is widely regarded as one of the most essential problems in today's AI community, with text-to-image generation having gained unprecedented real-world impacts. Among various approaches, diffusion models have achieved remarkable success and have become the de facto solution for text-to-image generation. However, despite their impressive performance, these models exhibit fundamental limitations in adhering to numerical constraints in user instructions, frequently generating images with an incorrect number of objects. While several prior works have mentioned this issue, a comprehensive and rigorous evaluation of this limitation remains lacking. To address this gap, we introduce T2ICountBench, a novel benchmark designed to rigorously evaluate the counting ability of state-of-the-art text-to-image diffusion models. Our benchmark encompasses a diverse set of generative models, including both open-source and private systems. It explicitly isolates counting performance from other capabilities, provides structured difficulty levels, and incorporates human evaluations to ensure high reliability. Extensive evaluations with T2ICountBench reveal that all state-of-the-art diffusion models fail to generate the correct number of objects, with accuracy dropping significantly as the number of objects increases. Additionally, an exploratory study on prompt refinement demonstrates that such simple interventions generally do not improve counting accuracy. Our findings highlight the inherent challenges in numerical understanding within diffusion models and point to promising directions for future improvements.
生成式模型被广泛认为是当今AI社区中最基本的问题之一,其中文本到图像的生成已经产生了前所未有的现实世界影响。在各种方法中,扩散模型取得了显著的成功,并已成为文本到图像生成的标准解决方案。然而,尽管它们表现出色,这些模型在遵守用户指令中的数值约束方面存在根本性限制,经常生成包含错误对象数量的图片。虽然一些先前的研究工作提到了这个问题,但对其进行全面而严格的评估仍然不足。 为了填补这一空白,我们引入了T2ICountBench,这是一个新型基准测试平台,旨在严格评估最先进的文本到图像扩散模型在计数方面的能力。该基准涵盖了各种生成式模型,包括开源和专有系统,并明确地将计数性能与其他能力区分开来。它提供结构化的难度级别,并结合了人类评价以确保高可靠性。 使用T2ICountBench进行的广泛评估表明,所有最先进的扩散模型在生成正确的对象数量方面都失败了,随着对象数量的增加,准确性显著下降。此外,关于提示调整的研究发现,此类简单干预通常不会提高计数精度。 我们的研究结果突出了扩散模型中数值理解所面临的内在挑战,并为未来改进指明了有前景的方向。
https://arxiv.org/abs/2503.06884
Diffusion models have achieved remarkable advances in various image generation tasks. However, their performance notably declines when generating images at resolutions higher than those used during the training period. Despite the existence of numerous methods for producing high-resolution images, they either suffer from inefficiency or are hindered by complex operations. In this paper, we propose RectifiedHR, an efficient and straightforward solution for training-free high-resolution image generation. Specifically, we introduce the noise refresh strategy, which theoretically only requires a few lines of code to unlock the model's high-resolution generation ability and improve efficiency. Additionally, we first observe the phenomenon of energy decay that may cause image blurriness during the high-resolution image generation process. To address this issue, we propose an Energy Rectification strategy, where modifying the hyperparameters of the classifier-free guidance effectively improves the generation performance. Our method is entirely training-free and boasts a simple implementation logic. Through extensive comparisons with numerous baseline methods, our RectifiedHR demonstrates superior effectiveness and efficiency.
扩散模型在各种图像生成任务中取得了显著的进展。然而,当它们用于生成高于训练期间所用分辨率的高分辨率图像时,其性能会明显下降。尽管存在许多可以产生高质量高分辨率图像的方法,但这些方法要么效率低下,要么受到复杂操作的限制。在这篇论文中,我们提出了RectifiedHR,这是一种无需训练即可生成高分辨率图像的有效且简单的解决方案。具体来说,我们引入了噪声刷新策略,理论上仅需修改几行代码就能解锁模型的高分辨率生成能力,并提高其效率。此外,我们首次观察到了在高分辨率图像生成过程中可能出现的能量衰减现象,这可能导致图像模糊。为了解决这个问题,我们提出了能量校正策略,在该策略中,通过调整无分类器引导(classifier-free guidance)中的超参数可以有效提升生成性能。我们的方法完全不需要重新训练,并且拥有简单明了的实现逻辑。通过与众多基线方法进行广泛的比较,RectifiedHR展示了其在有效性及效率上的优越性。
https://arxiv.org/abs/2503.02537
In image generation, Schrödinger Bridge (SB)-based methods theoretically enhance the efficiency and quality compared to the diffusion models by finding the least costly path between two distributions. However, they are computationally expensive and time-consuming when applied to complex image data. The reason is that they focus on fitting globally optimal paths in high-dimensional spaces, directly generating images as next step on the path using complex networks through self-supervised training, which typically results in a gap with the global optimum. Meanwhile, most diffusion models are in the same path subspace generated by weights $f_A(t)$ and $f_B(t)$, as they follow the paradigm ($x_t = f_A(t)x_{Img} + f_B(t)\epsilon$). To address the limitations of SB-based methods, this paper proposes for the first time to find local Diffusion Schrödinger Bridges (LDSB) in the diffusion path subspace, which strengthens the connection between the SB problem and diffusion models. Specifically, our method optimizes the diffusion paths using Kolmogorov-Arnold Network (KAN), which has the advantage of resistance to forgetting and continuous output. The experiment shows that our LDSB significantly improves the quality and efficiency of image generation using the same pre-trained denoising network and the KAN for optimising is only less than 0.1MB. The FID metric is reduced by \textbf{more than 15\%}, especially with a reduction of 48.50\% when NFE of DDIM is $5$ for the CelebA dataset. Code is available at this https URL.
在图像生成领域,基于薛定谔桥(Schrödinger Bridge, SB)的方法理论上通过寻找两个分布之间的最低成本路径来提高效率和质量,相比扩散模型(diffusion models)有所改进。然而,在应用于复杂图像数据时,这些方法计算成本高昂且耗时长。原因是它们专注于在高维空间中拟合全局最优路径,并直接利用复杂的网络进行自监督训练生成下一步的图像,这通常会导致与全局最优解之间的差距。 另一方面,大多数扩散模型遵循范式 $(x_t = f_A(t)x_{\text{Img}} + f_B(t)\epsilon)$,它们也位于由权重 $f_A(t)$ 和 $f_B(t)$ 生成的路径子空间中。为了解决基于SB方法的局限性,本文首次提出在扩散路径子空间内寻找局部扩散薛定谔桥(Local Diffusion Schrödinger Bridge, LDSB),这加强了SB问题与扩散模型之间的联系。 具体来说,我们的方法使用Kolmogorov-Arnold网络(KAN)来优化扩散路径。该网络具有抵抗遗忘和连续输出的优点。实验表明,在使用相同的预训练去噪网络的情况下,LDSB显著提高了图像生成的质量和效率,并且用于优化的KAN模型大小仅为不到0.1MB。在CelebA数据集上,当DDIM的NFE为5时,FID度量减少了**超过15%**,特别是在这种情况下减少幅度达到了48.50%。 代码可以在以下网址获取:[此链接](https://this https URL) (请将“this https URL”替换为实际提供的链接)。
https://arxiv.org/abs/2502.19754
Text-to-Image models, including Stable Diffusion, have significantly improved in generating images that are highly semantically aligned with the given prompts. However, existing models may fail to produce appropriate images for the cultural concepts or objects that are not well known or underrepresented in western cultures, such as `hangari' (Korean utensil). In this paper, we propose a novel approach, Culturally-Aware Text-to-Image Generation with Iterative Prompt Refinement (Culture-TRIP), which refines the prompt in order to improve the alignment of the image with such culture nouns in text-to-image models. Our approach (1) retrieves cultural contexts and visual details related to the culture nouns in the prompt and (2) iteratively refines and evaluates the prompt based on a set of cultural criteria and large language models. The refinement process utilizes the information retrieved from Wikipedia and the Web. Our user survey, conducted with 66 participants from eight different countries demonstrates that our proposed approach enhances the alignment between the images and the prompts. In particular, C-TRIP demonstrates improved alignment between the generated images and underrepresented culture nouns. Resource can be found at this https URL.
文本到图像模型(如Stable Diffusion)在生成与给定提示高度语义对齐的图像方面取得了显著进步。然而,现有的模型可能无法为西方文化中不常见或代表性不足的文化概念或物品(例如韩国器皿“hangari”)生成适当的图像。为此,在本文中我们提出了一种新颖的方法——具有迭代提示优化的文化感知文本到图像生成(Culture-TRIP),该方法通过改进提示来提高这些文化名词在文本到图像模型中的对齐度。 我们的方法包括: 1. 检索与提示中文化名词相关的文化背景和视觉细节; 2. 根据一系列文化和大规模语言模型的评估,迭代优化和评价提示。此过程利用从维基百科和网络检索的信息。 我们进行了一项用户调查,共66名来自8个不同国家的参与者参与其中。调查显示我们的方法能够增强图像与提示之间的对齐度,特别是在生成图片与其文化名词的相关性方面得到了改进。相关资源可在提供的链接中找到。
https://arxiv.org/abs/2502.16902