Personalized models have demonstrated remarkable success in understanding and generating concepts provided by users. However, existing methods use separate concept tokens for understanding and generation, treating these tasks in isolation. This may result in limitations for generating images with complex prompts. For example, given the concept $\langle bo\rangle$, generating "$\langle bo\rangle$ wearing its hat" without additional textual descriptions of its hat. We call this kind of generation personalized knowledge-driven generation. To address the limitation, we present UniCTokens, a novel framework that effectively integrates personalized information into a unified vision language model (VLM) for understanding and generation. UniCTokens trains a set of unified concept tokens to leverage complementary semantics, boosting two personalized tasks. Moreover, we propose a progressive training strategy with three stages: understanding warm-up, bootstrapping generation from understanding, and deepening understanding from generation to enhance mutual benefits between both tasks. To quantitatively evaluate the unified VLM personalization, we present UnifyBench, the first benchmark for assessing concept understanding, concept generation, and knowledge-driven generation. Experimental results on UnifyBench indicate that UniCTokens shows competitive performance compared to leading methods in concept understanding, concept generation, and achieving state-of-the-art results in personalized knowledge-driven generation. Our research demonstrates that enhanced understanding improves generation, and the generation process can yield valuable insights into understanding. Our code and dataset will be released at: \href{this https URL}{this https URL}.
个性化模型在理解并生成用户提供的概念方面取得了显著的成功。然而,现有的方法使用单独的概念令牌来处理理解和生成任务,并将这两个任务视为孤立的。这可能会限制生成具有复杂提示图像的能力。例如,在给定概念$\langle bo\rangle$的情况下,无法仅通过不提供帽子的文字描述就生成“$\langle bo\rangle$戴着它的帽子”。我们将这种类型的生成称为个性化知识驱动生成。 为了解决这一局限性,我们提出了UniCTokens,这是一种新型框架,可以将个性化的信息有效地整合到一个统一的视觉语言模型(VLM)中,以支持理解和生成任务。UniCTokens训练一组统一的概念令牌来利用互补语义,从而增强两个个性化任务的表现。此外,我们还提出了一种分阶段训练策略,包含三个步骤:理解预热、从理解启动生成、以及通过生成加深理解,以此来增强两项任务之间的相互利益。 为了定量评估统一的VLM个性化,我们推出了UnifyBench,这是首个用于评估概念理解、概念生成和知识驱动生成效果的标准基准。在UnifyBench上的实验结果表明,UniCTokens在概念理解和概念生成方面与现有领先方法的表现相当,在个性化知识驱动生成方面的表现更是达到了最先进的水平。 我们的研究证明了增强的理解可以提升生成能力,并且生成过程能够提供对理解的宝贵见解。我们的代码和数据集将在[此链接](this https URL)发布。
https://arxiv.org/abs/2505.14671
Image Generation models are a trending topic nowadays, with many people utilizing Artificial Intelligence models in order to generate images. There are many such models which, given a prompt of a text, will generate an image which depicts said prompt. There are many image generation models, such as Latent Diffusion Models, Denoising Diffusion Probabilistic Models, Generative Adversarial Networks and many more. When generating images, these models can generate sensitive image data, which can be threatening to privacy or may violate copyright laws of private entities. Machine unlearning aims at removing the influence of specific data subsets from the trained models and in the case of image generation models, remove the influence of a concept such that the model is unable to generate said images of the concept when prompted. Conventional retraining of the model can take upto days, hence fast algorithms are the need of the hour. In this paper we propose an algorithm that aims to remove the influence of concepts in diffusion models through updating the gradients of the final layers of the text encoders. Using a weighted loss function, we utilize backpropagation in order to update the weights of the final layers of the Text Encoder componet of the Stable Diffusion Model, removing influence of the concept from the text-image embedding space, such that when prompted, the result is an image not containing the concept. The weighted loss function makes use of Textual Inversion and Low-Rank this http URL perform our experiments on Latent Diffusion Models, namely the Stable Diffusion v2 model, with an average concept unlearning runtime of 50 seconds using 4-5 images.
图像生成模型是当今热门的话题,许多人利用人工智能模型来生成图片。这些模型中的许多在收到文本提示后能够生成相应的图片。目前有许多图像生成模型,如潜在扩散模型、去噪扩散概率模型以及生成对抗网络等。当使用这些模型进行图像生成时,可能会产生涉及隐私或可能侵犯私有实体版权的敏感内容。 机器遗忘技术旨在从训练好的模型中移除特定数据集的影响,在图像生成模型的情况下,则是为了去除某些概念的影响,使得在收到相应提示后,模型无法生成带有该概念的图片。传统的重新训练过程可能需要数天时间来完成,因此快速算法变得尤为重要。在这篇论文中,我们提出了一种旨在通过更新文本编码器最终层的梯度来移除扩散模型中的特定概念影响的算法。使用加权损失函数并通过反向传播方法来更新稳定扩散模型(Stable Diffusion)文本编码组件最终层的权重,在此过程中从文本-图像嵌入空间中去除该概念的影响,使得当再次收到相同提示时生成的结果图片不再包含该概念。 我们在潜在扩散模型上进行了实验验证,具体使用的是Stable Diffusion v2版本的模型。我们的实验表明,通过使用4到5张相关图片进行训练,在平均情况下可以实现大约50秒的概念遗忘运行时间。
https://arxiv.org/abs/2505.12395
Stable Diffusion has advanced text-to-image synthesis, but training models to generate images with accurate object quantity is still difficult due to the high computational cost and the challenge of teaching models the abstract concept of quantity. In this paper, we propose CountDiffusion, a training-free framework aiming at generating images with correct object quantity from textual descriptions. CountDiffusion consists of two stages. In the first stage, an intermediate denoising result is generated by the diffusion model to predict the final synthesized image with one-step denoising, and a counting model is used to count the number of objects in this image. In the second stage, a correction module is used to correct the object quantity by changing the attention map of the object with universal guidance. The proposed CountDiffusion can be plugged into any diffusion-based text-to-image (T2I) generation models without further training. Experiment results demonstrate the superiority of our proposed CountDiffusion, which improves the accurate object quantity generation ability of T2I models by a large margin.
稳定扩散模型在文本到图像的合成方面取得了进展,但训练模型生成具有准确对象数量的图像仍然困难,这主要是由于计算成本高昂以及向模型传授抽象的数量概念极具挑战性。本文提出了一种名为CountDiffusion的新框架,该框架旨在从文字描述中生成具有正确对象数量的图像,并且无需额外训练。CountDiffusion由两个阶段组成:第一阶段使用扩散模型通过一步去噪预测最终合成的图像并生成中间去噪结果,同时利用一个计数模型来计算此图像中的物体数量;第二阶段则采用修正模块根据通用引导改变物体的关注图以纠正对象的数量。 所提出的CountDiffusion可以无缝集成到任何基于扩散的文本到图像(T2I)生成模型中而无需进一步训练。实验结果显示了我们提出的方法CountDiffusion的优势,它大大提高了T2I模型准确生成对象数量的能力。
https://arxiv.org/abs/2505.04347
This study presents a novel approach to enhance the cost-to-quality ratio of image generation with diffusion models. We hypothesize that differences between distilled (e.g. FLUX.1-schnell) and baseline (e.g. FLUX.1-dev) models are consistent and, therefore, learnable within a specialized domain, like portrait generation. We generate a synthetic paired dataset and train a fast image-to-image translation head. Using two sets of low- and high-quality synthetic images, our model is trained to refine the output of a distilled generator (e.g., FLUX.1-schnell) to a level comparable to a baseline model like FLUX.1-dev, which is more computationally intensive. Our results show that the pipeline, which combines a distilled version of a large generative model with our enhancement layer, delivers similar photorealistic portraits to the baseline version with up to an 82% decrease in computational cost compared to FLUX.1-dev. This study demonstrates the potential for improving the efficiency of AI solutions involving large-scale image generation.
这项研究提出了一种新方法,旨在提高使用扩散模型生成图像的成本与质量比率。我们假设经过蒸馏(例如FLUX.1-schnell)和基准(例如FLUX.1-dev)模型之间的差异在特定领域内是稳定且可学习的,比如人像生成领域。为此,我们生成了一个合成配对数据集,并训练了一个快速图像到图像翻译头。使用两组低质量和高质量的人工合成图像,我们的模型被训练以优化一个蒸馏生成器(例如FLUX.1-schnell)的输出,使其质量接近于计算资源需求更高的基准模型如FLUX.1-dev。 研究结果表明,结合大尺寸生成模型的简化版本与增强层的管道能够提供类似于基线版本的逼真图像,但相比FLUX.1-dev可降低高达82%的计算成本。这项研究表明,在大规模图像生成涉及的人工智能解决方案中,存在提高效率的巨大潜力。
https://arxiv.org/abs/2505.02255
Recent audio-visual generative models have made substantial progress in generating images from audio. However, existing approaches focus on generating images from single-class audio and fail to generate images from mixed audio. To address this, we propose an Audio-Visual Generation and Separation model (AV-GAS) for generating images from soundscapes (mixed audio containing multiple classes). Our contribution is threefold: First, we propose a new challenge in the audio-visual generation task, which is to generate an image given a multi-class audio input, and we propose a method that solves this task using an audio-visual separator. Second, we introduce a new audio-visual separation task, which involves generating separate images for each class present in a mixed audio input. Lastly, we propose new evaluation metrics for the audio-visual generation task: Class Representation Score (CRS) and a modified R@K. Our model is trained and evaluated on the VGGSound dataset. We show that our method outperforms the state-of-the-art, achieving 7% higher CRS and 4% higher R@2* in generating plausible images with mixed audio.
近期的音频-视觉生成模型在从音频生成图像方面取得了显著进展。然而,现有方法主要集中在单一类别音频的图像生成上,无法处理包含多类别的混合音频(soundscape)以生成相应的图像。为了解决这一问题,我们提出了一种名为“音频-视觉生成与分离”(AV-GAS) 的模型,用于从混合音频中生成图像。 我们的贡献主要有三个方面: 首先,我们提出了一个新的挑战性任务,在给定多类别音频输入的情况下生成对应的图像,并且通过引入音频-视觉分离器来解决这个任务。这意味着我们能够将复杂的混合声音分解为多个独立的声音源,然后针对每个单独的音源生成相应的视觉内容。 其次,我们还引入了一个新的音频-视觉分离任务,即对给定混合音频中每一种类别的对象分别生成对应的图像。 最后,为了评估这种多类别音频输入下的图像生成效果,我们提出了两个新的评价指标:类别表示得分(CRS)和改进的R@K。这些指标能够更全面地衡量模型在处理复杂音频场景时的表现能力。 我们的模型是在VGGSound数据集上进行训练和评估的。实验结果表明,与现有的最佳方法相比,AV-GAS可以生成更具逼真的图像,并且在多类别混合音频条件下实现了7%的CRS提升以及4%的R@2*(Recall at K)改进。 以上研究充分展示了从复杂声音场景中提取并准确对应视觉信息的能力,为未来开发更先进的音频-视觉同步技术铺平了道路。
https://arxiv.org/abs/2504.18283
Generating images from prompts containing specific entities requires models to retain as much entity-specific knowledge as possible. However, fully memorizing such knowledge is impractical due to the vast number of entities and their continuous emergence. To address this, we propose Text-based Intelligent Generation with Entity prompt Refinement (TextTIGER), which augments knowledge on entities included in the prompts and then summarizes the augmented descriptions using Large Language Models (LLMs) to mitigate performance degradation from longer inputs. To evaluate our method, we introduce WiT-Cub (WiT with Captions and Uncomplicated Background-explanations), a dataset comprising captions, images, and an entity list. Experiments on four image generation models and five LLMs show that TextTIGER improves image generation performance in standard metrics (IS, FID, and CLIPScore) compared to caption-only prompts. Additionally, multiple annotators' evaluation confirms that the summarized descriptions are more informative, validating LLMs' ability to generate concise yet rich descriptions. These findings demonstrate that refining prompts with augmented and summarized entity-related descriptions enhances image generation capabilities. The code and dataset will be available upon acceptance.
从包含特定实体的提示生成图像要求模型尽可能保留与这些实体相关的知识。然而,由于实体数量庞大且不断涌现,完全记忆这种知识是不切实际的。为解决这一问题,我们提出了基于文本的智能生成与实体提示优化(TextTIGER)方法。该方法增强了提示中包含的实体的知识,并使用大型语言模型(LLMs)对增强后的描述进行总结,以减轻由于输入过长而导致性能下降的问题。 为了评估我们的方法,我们引入了WiT-Cub数据集(带有说明和简洁背景解释的WiT),该数据集包括描述、图像以及一个实体列表。在四种图像生成模型和五种大型语言模型上的实验表明,在标准度量指标(IS、FID 和 CLIPScore)下,TextTIGER相比仅使用描述提示的方法提升了图像生成性能。此外,多位标注者的评估确认了总结后的描述更具信息性,这验证了LLMs能够生成简洁而丰富的描述的能力。 这些发现证明了通过增强和总结实体相关的描述来优化提示可以提升图像生成能力。在论文被接受后,代码和数据集将公开提供。
https://arxiv.org/abs/2504.18269
Recent advances in Talking Head Generation (THG) have achieved impressive lip synchronization and visual quality through diffusion models; yet existing methods struggle to generate emotionally expressive portraits while preserving speaker identity. We identify three critical limitations in current emotional talking head generation: insufficient utilization of audio's inherent emotional cues, identity leakage in emotion representations, and isolated learning of emotion correlations. To address these challenges, we propose a novel framework dubbed as DICE-Talk, following the idea of disentangling identity with emotion, and then cooperating emotions with similar characteristics. First, we develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention, representing emotions as identity-agnostic Gaussian distributions. Second, we introduce a correlation-enhanced emotion conditioning module with learnable Emotion Banks that explicitly capture inter-emotion relationships through vector quantization and attention-based feature aggregation. Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process through latent-space classification. Extensive experiments on MEAD and HDTF datasets demonstrate our method's superiority, outperforming state-of-the-art approaches in emotion accuracy while maintaining competitive lip-sync performance. Qualitative results and user studies further confirm our method's ability to generate identity-preserving portraits with rich, correlated emotional expressions that naturally adapt to unseen identities.
最近在 Talking Head Generation (THG) 方面的进展,通过扩散模型实现了令人印象深刻的唇部同步和视觉质量;然而,现有的方法在生成具有情感表达力的同时保持说话者身份方面仍存在困难。我们指出了当前情感面部生成中的三个关键限制:音频中内在的情感线索利用不足、情感表示中的身份泄露以及情感关联的孤立学习。 为了应对这些挑战,我们提出了一种新的框架,称为 DICE-Talk,该框架遵循将身份与情绪解耦然后合作具有相似特征的情绪的理念。首先,我们开发了一个解耦式情感嵌入器,通过跨模态注意力同时建模音频-视觉的情感线索,表示为无身份的高斯分布。其次,我们引入了一种增强关联的情感条件模块,并采用可学习的情感库,该模块通过向量量化和基于注意力的功能聚合明确捕捉了情感之间的关系。第三,我们设计了一个情绪判别目标,在扩散过程中通过潜在空间分类强制执行情感一致性。 在 MEAD 和 HDTF 数据集上的广泛实验表明,我们的方法优于现有最佳方法,在情感准确性方面表现出色的同时保持了竞争性的唇部同步性能。定性结果和用户研究进一步确认了我们方法生成的身份一致且具有丰富相关情感表情的能力,并能够自然适应未见过的身份。
https://arxiv.org/abs/2504.18087
We propose a novel framework for ID-preserving generation using a multi-modal encoding strategy rather than injecting identity features via adapters into pre-trained models. Our method treats identity and text as a unified conditioning input. To achieve this, we introduce FaceCLIP, a multi-modal encoder that learns a joint embedding space for both identity and textual semantics. Given a reference face and a text prompt, FaceCLIP produces a unified representation that encodes both identity and text, which conditions a base diffusion model to generate images that are identity-consistent and text-aligned. We also present a multi-modal alignment algorithm to train FaceCLIP, using a loss that aligns its joint representation with face, text, and image embedding spaces. We then build FaceCLIP-SDXL, an ID-preserving image synthesis pipeline by integrating FaceCLIP with Stable Diffusion XL (SDXL). Compared to prior methods, FaceCLIP-SDXL enables photorealistic portrait generation with better identity preservation and textual relevance. Extensive experiments demonstrate its quantitative and qualitative superiority.
我们提出了一种新颖的框架,用于通过多模态编码策略而非使用适配器向预训练模型注入身份特征来进行保持身份特性的生成。我们的方法将身份和文本视为统一的条件输入。为此,我们引入了FaceCLIP,这是一种多模态编码器,能够为身份和文本语义学习联合嵌入空间。给定一个参考人脸图像和一段文字提示,FaceCLIP可以产生一种同时包含身份信息和文本内容的统一表示形式,这种表示形式能条件化基础扩散模型以生成既符合身份又与文本相关的图像。此外,我们还提出了一种多模态对齐算法来训练FaceCLIP,该算法使用一种损失函数将其联合表示与人脸、文本及图像嵌入空间进行对齐。接着,我们将FaceCLIP与Stable Diffusion XL (SDXL)集成起来构建了FaceCLIP-SDXL,这是一种保持身份特性的图像合成流水线。相比之前的方法,FaceCLIP-SDXL能够生成更逼真的肖像图片,并且在身份保存和文本相关性方面表现更好。广泛的实验表明其具有定量和定性的优越性能。
https://arxiv.org/abs/2504.14202
In this work, we introduce MedIL, a first-of-its-kind autoencoder built for encoding medical images with heterogeneous sizes and resolutions for image generation. Medical images are often large and heterogeneous, where fine details are of vital clinical importance. Image properties change drastically when considering acquisition equipment, patient demographics, and pathology, making realistic medical image generation challenging. Recent work in latent diffusion models (LDMs) has shown success in generating images resampled to a fixed-size. However, this is a narrow subset of the resolutions native to image acquisition, and resampling discards fine anatomical details. MedIL utilizes implicit neural representations to treat images as continuous signals, where encoding and decoding can be performed at arbitrary resolutions without prior resampling. We quantitatively and qualitatively show how MedIL compresses and preserves clinically-relevant features over large multi-site, multi-resolution datasets of both T1w brain MRIs and lung CTs. We further demonstrate how MedIL can influence the quality of images generated with a diffusion model, and discuss how MedIL can enhance generative models to resemble raw clinical acquisitions.
在这项工作中,我们介绍了MedIL,这是一种首创的自动编码器,专门用于编码大小和分辨率各异的医学影像以生成图像。医学影像通常很大且异质化,在这些影像中,细微细节对于临床诊断至关重要。当考虑获取设备、患者人口统计学以及病理情况时,图像特性会发生剧烈变化,这使得真实的医学影像生成变得具有挑战性。最近在潜在扩散模型(LDMs)方面的工作已成功地用于生成调整到固定大小的图像,但这种方法仅适用于原始获取中有限的一组分辨率,并且重采样会丢失细微的解剖细节。MedIL利用隐式神经表示将图像视为连续信号,在这种情况下编码和解码可以在无需先进行重采样的条件下在任意分辨率下完成。 我们通过定量和定性的方法展示了MedIL如何对大型多中心、多分辨率的数据集中的T1加权脑MRI和肺部CT影像的临床相关特征进行压缩并保持。此外,我们还演示了MedIL是如何影响与扩散模型生成图像的质量,并讨论了MedIL可以如何增强生成模型以更接近原始临床获取的效果。
https://arxiv.org/abs/2504.09322
Creating a realistic animatable avatar from a single static portrait remains challenging. Existing approaches often struggle to capture subtle facial expressions, the associated global body movements, and the dynamic background. To address these limitations, we propose a novel framework that leverages a pretrained video diffusion transformer model to generate high-fidelity, coherent talking portraits with controllable motion dynamics. At the core of our work is a dual-stage audio-visual alignment strategy. In the first stage, we employ a clip-level training scheme to establish coherent global motion by aligning audio-driven dynamics across the entire scene, including the reference portrait, contextual objects, and background. In the second stage, we refine lip movements at the frame level using a lip-tracing mask, ensuring precise synchronization with audio signals. To preserve identity without compromising motion flexibility, we replace the commonly used reference network with a facial-focused cross-attention module that effectively maintains facial consistency throughout the video. Furthermore, we integrate a motion intensity modulation module that explicitly controls expression and body motion intensity, enabling controllable manipulation of portrait movements beyond mere lip motion. Extensive experimental results show that our proposed approach achieves higher quality with better realism, coherence, motion intensity, and identity preservation. Ours project page: this https URL.
从单张静态肖像创建一个逼真的可动画化身仍然具有挑战性。现有方法往往难以捕捉微妙的面部表情、相关的全身动作以及动态背景。为了解决这些限制,我们提出了一种新颖框架,利用预训练的视频扩散变换器模型生成高保真度、连贯的说话头像,并且可以控制运动动力学。我们的工作核心是一种双阶段音频-视觉对齐策略。 在第一阶段,我们采用片段级训练方案,通过在整个场景中(包括参考肖像、上下文对象和背景)对准由音频驱动的动力学来建立连贯的整体运动。在第二阶段,我们使用唇部跟踪掩码以帧为单位细化嘴唇动作,确保与音频信号的精确同步。 为了保持身份一致性而不牺牲运动灵活性,我们将常用的参考网络替换为面部聚焦的跨注意力模块,该模块在整个视频中有效维持面部一致性。此外,我们整合了一个运动强度调节模块,它明确控制表情和身体运动强度,从而实现头像动作(不仅仅是唇部动作)可操控地调整。 广泛的实验结果表明,我们的方法在质量和现实感、连贯性、运动强度和身份保持方面均优于现有技术。 有关我们的项目的更多信息,请访问此链接:[项目页面链接] (请将“this https URL”替换为实际的项目页面URL)。
https://arxiv.org/abs/2504.04842
Generating naturalistic and nuanced listener motions for extended interactions remains an open problem. Existing methods often rely on low-dimensional motion codes for facial behavior generation followed by photorealistic rendering, limiting both visual fidelity and expressive richness. To address these challenges, we introduce DiTaiListener, powered by a video diffusion model with multimodal conditions. Our approach first generates short segments of listener responses conditioned on the speaker's speech and facial motions with DiTaiListener-Gen. It then refines the transitional frames via DiTaiListener-Edit for a seamless transition. Specifically, DiTaiListener-Gen adapts a Diffusion Transformer (DiT) for the task of listener head portrait generation by introducing a Causal Temporal Multimodal Adapter (CTM-Adapter) to process speakers' auditory and visual cues. CTM-Adapter integrates speakers' input in a causal manner into the video generation process to ensure temporally coherent listener responses. For long-form video generation, we introduce DiTaiListener-Edit, a transition refinement video-to-video diffusion model. The model fuses video segments into smooth and continuous videos, ensuring temporal consistency in facial expressions and image quality when merging short video segments produced by DiTaiListener-Gen. Quantitatively, DiTaiListener achieves the state-of-the-art performance on benchmark datasets in both photorealism (+73.8% in FID on RealTalk) and motion representation (+6.1% in FD metric on VICO) spaces. User studies confirm the superior performance of DiTaiListener, with the model being the clear preference in terms of feedback, diversity, and smoothness, outperforming competitors by a significant margin.
生成自然且细腻的听者动作,以支持长时间互动的问题仍然未得到解决。现有方法通常依赖于低维运动代码来生成面部行为,并随后进行逼真的渲染,这既限制了视觉保真度也削弱了表现力的丰富性。为了应对这些挑战,我们引入了DiTaiListener,它是由多模态条件下的视频扩散模型驱动的。我们的方法首先使用DiTaiListener-Gen根据说话人的语音和面部动作生成听众反应的短片段。然后通过DiTaiListener-Edit改进过渡帧以实现无缝连接。 具体来说,DiTaiListener-Gen采用了一种经过改编的Diffusion Transformer(DiT)用于听者头像生成任务,并引入了一个因果时间多模态适配器(CTM-Adapter),用以处理说话人的音频和视觉线索。CTM-Adapter将说话人输入以因果方式整合到视频生成过程中,确保了在产生连贯且一致的听众反应时的时间连续性。 对于长时间视频生成,我们引入了DiTaiListener-Edit,这是一个用于过渡细化的视频到视频扩散模型。该模型融合短片段视频以生成流畅且连贯的长视频,确保在将由DiTaiListener-Gen产生的短视频片段合并后,在面部表情和图像质量方面的时间一致性。 从量化指标来看,DiTaiListener在基准数据集上的表现达到了最先进的水平,分别在逼真度(RealTalk数据集上FID得分提升73.8%)和运动表示能力(VICO数据集上FD指标提高6.1%)。用户研究证实了DiTaiListener的优越性,模型在反馈、多样性和流畅性方面明显优于竞争对手。
https://arxiv.org/abs/2504.04010
Text-to-image diffusion models excel at generating diverse portraits, but lack intuitive shadow control. Existing editing approaches, as post-processing, struggle to offer effective manipulation across diverse styles. Additionally, these methods either rely on expensive real-world light-stage data collection or require extensive computational resources for training. To address these limitations, we introduce Shadow Director, a method that extracts and manipulates hidden shadow attributes within well-trained diffusion models. Our approach uses a small estimation network that requires only a few thousand synthetic images and hours of training-no costly real-world light-stage data needed. Shadow Director enables parametric and intuitive control over shadow shape, placement, and intensity during portrait generation while preserving artistic integrity and identity across diverse styles. Despite training only on synthetic data built on real-world identities, it generalizes effectively to generated portraits with diverse styles, making it a more accessible and resource-friendly solution.
文本到图像的扩散模型在生成多样的肖像方面表现出色,但缺乏直观的阴影控制。现有的编辑方法作为后处理手段,在处理不同风格时难以提供有效的操作。此外,这些方法要么依赖于昂贵的真实世界光舞台数据收集,要么需要大量的计算资源进行训练。为了解决这些问题,我们介绍了Shadow Director方法,该方法可以从已经训练好的扩散模型中提取并操纵隐藏的阴影属性。我们的方法使用一个小型估计网络,只需要几千张合成图像和几个小时的训练时间——无需昂贵的真实世界光舞台数据。 Shadow Director在生成肖像时提供了参数化且直观的阴影形状、位置及强度控制,并能在保持艺术完整性和身份一致性的前提下应用于各种风格中。尽管仅基于真实世界的身份构建并经过少量合成数据训练,它仍然能够有效地推广到具有多样风格的生成肖像上,使其成为一个更易于使用和资源友好的解决方案。
https://arxiv.org/abs/2503.21943
Painting textures for existing geometries is a critical yet labor-intensive process in 3D asset generation. Recent advancements in text-to-image (T2I) models have led to significant progress in texture generation. Most existing research approaches this task by first generating images in 2D spaces using image diffusion models, followed by a texture baking process to achieve UV texture. However, these methods often struggle to produce high-quality textures due to inconsistencies among the generated multi-view images, resulting in seams and ghosting artifacts. In contrast, 3D-based texture synthesis methods aim to address these inconsistencies, but they often neglect 2D diffusion model priors, making them challenging to apply to real-world objects To overcome these limitations, we propose RomanTex, a multiview-based texture generation framework that integrates a multi-attention network with an underlying 3D representation, facilitated by our novel 3D-aware Rotary Positional Embedding. Additionally, we incorporate a decoupling characteristic in the multi-attention block to enhance the model's robustness in image-to-texture task, enabling semantically-correct back-view synthesis. Furthermore, we introduce a geometry-related Classifier-Free Guidance (CFG) mechanism to further improve the alignment with both geometries and images. Quantitative and qualitative evaluations, along with comprehensive user studies, demonstrate that our method achieves state-of-the-art results in texture quality and consistency.
为现有几何体绘制纹理是3D资产生成中的一个关键但耗时的过程。最近,文本到图像(T2I)模型的进展在纹理生成方面取得了显著进步。大多数现有的研究方法首先使用图像扩散模型在二维空间中生成图像,然后通过烘焙过程实现UV纹理。然而,这些方法往往难以产生高质量的纹理,因为它们生成的多视角图像之间存在不一致性,从而导致接缝和鬼影(ghosting)伪影。相比之下,基于3D的方法旨在解决这些问题,但通常忽视了二维扩散模型的先验知识,使其在应用于现实世界对象时具有挑战性。 为克服这些限制,我们提出了RomanTex,这是一种多视角纹理生成框架,它将一个多注意力网络与底层3D表示相结合,并通过我们的新型3D感知旋转位置嵌入(Rotary Positional Embedding)来实现。此外,我们在多注意力块中引入了一种解耦特性,以增强模型在图像到纹理任务中的鲁棒性,从而能够生成语义正确的背面合成结果。我们还引入了一种与几何相关的无分类器引导(Classifier-Free Guidance, CFG)机制,进一步提高纹理与几何和图像的一致性。 定量评估、定性评估以及全面的用户研究证明了我们的方法在纹理质量和一致性方面取得了最先进的成果。
https://arxiv.org/abs/2503.19011
Suffering from performance bottlenecks in passively detecting high-quality Deepfake images due to the advancement of generative models, proactive perturbations offer a promising approach to disabling Deepfake manipulations by inserting signals into benign images. However, existing proactive perturbation approaches remain unsatisfactory in several aspects: 1) visual degradation due to direct element-wise addition; 2) limited effectiveness against face swapping manipulation; 3) unavoidable reliance on white- and grey-box settings to involve generative models during training. In this study, we analyze the essence of Deepfake face swapping and argue the necessity of protecting source identities rather than target images, and we propose NullSwap, a novel proactive defense approach that cloaks source image identities and nullifies face swapping under a pure black-box scenario. We design an Identity Extraction module to obtain facial identity features from the source image, while a Perturbation Block is then devised to generate identity-guided perturbations accordingly. Meanwhile, a Feature Block extracts shallow-level image features, which are then fused with the perturbation in the Cloaking Block for image reconstruction. Furthermore, to ensure adaptability across different identity extractors in face swapping algorithms, we propose Dynamic Loss Weighting to adaptively balance identity losses. Experiments demonstrate the outstanding ability of our approach to fool various identity recognition models, outperforming state-of-the-art proactive perturbations in preventing face swapping models from generating images with correct source identities.
由于生成模型的进步,被动检测高质量Deepfake图像时会遇到性能瓶颈。在这种情况下,通过在良性图片中插入信号来主动干扰的方法提供了一种有前景的策略,以阻止Deepfake操纵。然而,现有的主动干扰方法在几个方面仍不尽如人意:1)直接元素级添加导致视觉质量下降;2)对抗面部交换操作的有效性有限;3)不可避免地依赖于白盒和灰盒设置,在训练过程中涉及生成模型。 在这项研究中,我们分析了Deepfake面部交换的本质,并主张保护源身份而非目标图像的重要性。为此,我们提出了一种名为NullSwap的新主动防御方法,该方法在纯黑盒场景下隐藏源图像的身份并使面部交换无效。具体来说: 1. **Identity Extraction模块**:从源图像中获取面部身份特征。 2. **Perturbation Block**:根据上述身份信息生成引导扰动。 3. **Feature Block和Cloaking Block**:前者提取浅层图像特征,后者则将这些特征与扰动融合进行图像重建。 此外,为了确保在不同面部交换算法中的适应性,我们提出了动态损失加权(Dynamic Loss Weighting),以自适应地平衡身份损失。实验结果表明,我们的方法能够有效地欺骗各种身份识别模型,并且优于现有的主动干扰技术,在防止Deepfake生成带有正确源身份的图像方面具有更好的性能。
https://arxiv.org/abs/2503.18678
In this paper, we address the problem of generative dataset distillation that utilizes generative models to synthesize images. The generator may produce any number of images under a preserved evaluation time. In this work, we leverage the popular diffusion model as the generator to compute a surrogate dataset, boosted by a min-max loss to control the dataset's diversity and representativeness during training. However, the diffusion model is time-consuming when generating images, as it requires an iterative generation process. We observe a critical trade-off between the number of image samples and the image quality controlled by the diffusion steps and propose Diffusion Step Reduction to achieve optimal performance. This paper details our comprehensive method and its performance. Our model achieved $2^{nd}$ place in the generative track of \href{this https URL}{The First Dataset Distillation Challenge of ECCV2024}, demonstrating its superior performance.
在这篇论文中,我们解决了利用生成模型合成图像的生成数据集提炼问题。生成器可以在保持评估时间不变的情况下产生任意数量的图像。在此研究中,我们采用流行的扩散模型作为生成器来计算一个替代数据集,并通过最小-最大损失函数在训练过程中控制数据集的多样性和代表性。然而,当生成图像时,扩散模型会非常耗时,因为它需要进行迭代生成过程。我们观察到,在由扩散步骤控制的图像质量和样本数量之间存在关键的权衡关系,并提出减少扩散步数的方法以达到最佳性能。本文详细介绍了我们的全面方法及其表现情况。在欧洲计算机视觉会议(ECCV2024)首次数据集提炼挑战赛的生成赛道中,我们的模型获得了第二名的好成绩,展示了其优越的表现能力。
https://arxiv.org/abs/2503.18626
Personalized portrait synthesis, essential in domains like social entertainment, has recently made significant progress. Person-wise fine-tuning based methods, such as LoRA and DreamBooth, can produce photorealistic outputs but need training on individual samples, consuming time and resources and posing an unstable risk. Adapter based techniques such as IP-Adapter freeze the foundational model parameters and employ a plug-in architecture to enable zero-shot inference, but they often exhibit a lack of naturalness and authenticity, which are not to be overlooked in portrait synthesis tasks. In this paper, we introduce a parameter-efficient adaptive generation method, namely HyperLoRA, that uses an adaptive plug-in network to generate LoRA weights, merging the superior performance of LoRA with the zero-shot capability of adapter scheme. Through our carefully designed network structure and training strategy, we achieve zero-shot personalized portrait generation (supporting both single and multiple image inputs) with high photorealism, fidelity, and editability.
个人肖像合成技术在社交娱乐等领域中至关重要,最近取得了显著进展。基于个人微调的方法(如LoRA和DreamBooth)可以生成逼真的图像输出,但需要对每个样本进行训练,这会消耗大量时间和资源,并且存在不稳定的隐患。而基于适配器的技术(例如IP-Adapter),冻结基础模型参数并采用插件架构以实现零样本推理,但在肖像合成任务中往往缺乏自然感和真实性。 在本文中,我们提出了一种参数高效的自适应生成方法——HyperLoRA,该方法使用一个自适应的插件网络来生成LoRA权重,从而结合了LoRA的优越性能与适配器方案的零样本推理能力。通过精心设计的网络结构和训练策略,我们的方法能够实现高逼真度、保真度及可编辑性的零样本个性化肖像生成(支持单图或多图输入)。
https://arxiv.org/abs/2503.16944
Large-scale text-to-image (T2I) diffusion models have revolutionized image generation, enabling the synthesis of highly detailed visuals from textual descriptions. However, these models may inadvertently generate inappropriate content, such as copyrighted works or offensive images. While existing methods attempt to eliminate specific unwanted concepts, they often fail to ensure complete removal, allowing the concept to reappear in subtle forms. For instance, a model may successfully avoid generating images in Van Gogh's style when explicitly prompted with 'Van Gogh', yet still reproduce his signature artwork when given the prompt 'Starry Night'. In this paper, we propose SAFER, a novel and efficient approach for thoroughly removing target concepts from diffusion models. At a high level, SAFER is inspired by the observed low-dimensional structure of the text embedding space. The method first identifies a concept-specific subspace $S_c$ associated with the target concept c. It then projects the prompt embeddings onto the complementary subspace of $S_c$, effectively erasing the concept from the generated images. Since concepts can be abstract and difficult to fully capture using natural language alone, we employ textual inversion to learn an optimized embedding of the target concept from a reference image. This enables more precise subspace estimation and enhances removal performance. Furthermore, we introduce a subspace expansion strategy to ensure comprehensive and robust concept erasure. Extensive experiments demonstrate that SAFER consistently and effectively erases unwanted concepts from diffusion models while preserving generation quality.
大规模的文本到图像(T2I)扩散模型已经革新了图像生成技术,能够从文本描述中合成出高度详细的视觉内容。然而,这些模型可能会无意间生成不适当的内容,例如受版权保护的作品或冒犯性的图像。现有的方法试图消除特定的不需要的概念,但往往无法确保完全移除这些概念,导致它们以微妙的形式重新出现。例如,当明确提示为“Van Gogh”时,模型可能成功避免生成梵高风格的图像,但在收到提示“星夜”的情况下,仍然会复制他的标志性作品。在这篇论文中,我们提出了SAFER(Safe and Efficient Removal),这是一种新的且高效的从扩散模型中彻底移除目标概念的方法。 总体而言,SAFER受到了文本嵌入空间低维结构观察到的现象的启发。该方法首先识别与目标概念c相关的特定子空间$S_c$。然后将提示嵌入投影到$S_c$的补子空间上,从而有效地从生成的图像中擦除概念。由于概念可能很抽象且难以仅通过自然语言完全捕捉,我们使用文本反转技术来从参考图像中学习目标概念的优化嵌入,这使更精确的子空间估计成为可能,并增强了移除性能。 此外,我们引入了一种子空间扩展策略,以确保全面和稳健的概念擦除。广泛的实验表明,SAFER能够持续且有效地从扩散模型中移除不需要的概念,同时保持生成的质量。
https://arxiv.org/abs/2503.16835
Super-resolution is aimed at reconstructing high-resolution images from low-resolution observations. State-of-the-art approaches underpinned with deep learning allow for obtaining outstanding results, generating images of high perceptual quality. However, it often remains unclear whether the reconstructed details are close to the actual ground-truth information and whether they constitute a more valuable source for image analysis algorithms. In the reported work, we address the latter problem, and we present our efforts toward learning super-resolution algorithms in a task-driven way to make them suitable for generating high-resolution images that can be exploited for automated image analysis. In the reported initial research, we propose a methodological approach for assessing the existing models that perform computer vision tasks in terms of whether they can be used for evaluating super-resolution reconstruction algorithms, as well as training them in a task-driven way. We support our analysis with experimental study and we expect it to establish a solid foundation for selecting appropriate computer vision tasks that will advance the capabilities of real-world super-resolution.
超分辨率技术旨在从低分辨率观测中重建高质量的高分辨率图像。基于深度学习的最新方法能够生成具有极高感知质量的图像,取得了出色的效果。然而,通常不清楚这些重建细节是否接近实际的真实信息以及它们是否构成更有利于图像分析算法的数据源。在本报告的工作中,我们解决了后者的问题,并展示了我们在以任务驱动的方式学习超分辨率算法方面的努力,以便使其适合生成可用于自动化图像分析的高分辨率图像。 在初步研究中,我们提出了一种方法论,用于评估现有的执行计算机视觉任务的模型是否可以用来评价超分辨率重建算法的质量,以及训练它们进行任务导向的学习。我们的分析得到了实验研究的支持,并期望它为选择适当的计算机视觉任务奠定坚实基础,这些任务将提升现实世界中超分辨率技术的能力。
https://arxiv.org/abs/2503.15474
Generating images with embedded text is crucial for the automatic production of visual and multimodal documents, such as educational materials and advertisements. However, existing diffusion-based text-to-image models often struggle to accurately embed text within images, facing challenges in spelling accuracy, contextual relevance, and visual coherence. Evaluating the ability of such models to embed text within a generated image is complicated due to the lack of comprehensive benchmarks. In this work, we introduce TextInVision, a large-scale, text and prompt complexity driven benchmark designed to evaluate the ability of diffusion models to effectively integrate visual text into images. We crafted a diverse set of prompts and texts that consider various attributes and text characteristics. Additionally, we prepared an image dataset to test Variational Autoencoder (VAE) models across different character representations, highlighting that VAE architectures can also pose challenges in text generation within diffusion frameworks. Through extensive analysis of multiple models, we identify common errors and highlight issues such as spelling inaccuracies and contextual mismatches. By pinpointing the failure points across different prompts and texts, our research lays the foundation for future advancements in AI-generated multimodal content.
将包含嵌入文本的图像生成对于自动生产视觉和多模态文档(如教育材料和广告)至关重要。然而,现有的基于扩散的文字到图像模型在准确地将文字嵌入图像方面经常遇到困难,面临着拼写准确性、上下文相关性和视觉连贯性等方面的挑战。由于缺乏全面的基准测试,评估此类模型在其生成的图像中嵌入文本的能力变得复杂。在这项工作中,我们引入了TextInVision,这是一个大规模的、以文本和提示复杂度驱动的基准测试系统,旨在评估扩散模型将视觉文字有效整合到图像中的能力。我们精心设计了一组多样化的提示语和文本,考虑了各种属性和文本特征。此外,我们还准备了一个图像数据集来测试变分自编码器(VAE)模型在不同字符表示上的性能,并强调VAE架构同样可能面临在扩散框架内生成文字方面的挑战。通过对多个模型进行广泛的分析,我们识别出常见的错误并指出了诸如拼写不准确和上下文不符等问题。通过确定不同类型提示语和文本中的失败点,我们的研究为未来AI生成的多模态内容的进步奠定了基础。
https://arxiv.org/abs/2503.13730
Audio-driven single-image talking portrait generation plays a crucial role in virtual reality, digital human creation, and filmmaking. Existing approaches are generally categorized into keypoint-based and image-based methods. Keypoint-based methods effectively preserve character identity but struggle to capture fine facial details due to the fixed points limitation of the 3D Morphable Model. Moreover, traditional generative networks face challenges in establishing causality between audio and keypoints on limited datasets, resulting in low pose diversity. In contrast, image-based approaches produce high-quality portraits with diverse details using the diffusion network but incur identity distortion and expensive computational costs. In this work, we propose KDTalker, the first framework to combine unsupervised implicit 3D keypoint with a spatiotemporal diffusion model. Leveraging unsupervised implicit 3D keypoints, KDTalker adapts facial information densities, allowing the diffusion process to model diverse head poses and capture fine facial details flexibly. The custom-designed spatiotemporal attention mechanism ensures accurate lip synchronization, producing temporally consistent, high-quality animations while enhancing computational efficiency. Experimental results demonstrate that KDTalker achieves state-of-the-art performance regarding lip synchronization accuracy, head pose diversity, and execution this http URL codes are available at this https URL.
音频驱动的单张图像说话肖像生成在虚拟现实、数字人创建和电影制作中扮演着重要角色。现有的方法通常被分为基于关键点的方法和基于图像的方法。基于关键点的方法有效地保留了人物的身份,但因3D可变形模型(3D Morphable Model)固定点限制而难以捕捉到精细的面部细节。此外,传统的生成网络在使用有限数据集建立音频与关键点之间的因果关系时面临挑战,导致姿势多样性较低。相比之下,基于图像的方法通过扩散网络能够产生具有丰富细节和高质量的人物肖像,但会导致身份失真,并且计算成本高昂。 在这项工作中,我们提出了KDTalker框架,这是第一个结合无监督隐式3D关键点与时空扩散模型的框架。利用无监督隐式的3D关键点,KDTalker可以根据面部信息密度调整面部信息,使得扩散过程能够灵活地模拟多样的头部姿势,并捕捉到细微的面部细节。此外,自定义设计的时空注意力机制确保了准确的唇部同步,在提高计算效率的同时产生了时间一致性高、高质量的动画。 实验结果表明,KDTalker在唇部同步准确性、头部姿态多样性以及执行效率方面均达到了最先进的水平。代码可在提供的链接中获取。
https://arxiv.org/abs/2503.12963