We consider the problem of disentangling 3D from large vision-language models, which we show on generative 3D portraits. This allows free-form text control of appearance attributes like age, hair style, and glasses, and 3D geometry control of face expression and camera pose. In this setting, we assume we use a pre-trained large vision-language model (LVLM; CLIP) to generate from a smaller 2D dataset with no additional paired labels and with a pre-defined 3D morphable model (FLAME). First, we disentangle using canonicalization to a 2D reference frame from a deformable neural 3D triplane representation. But another form of entanglement arises from the significant noise in the LVLM's embedding space that describes irrelevant features. This damages output quality and diversity, but we overcome this with a Jacobian regularization that can be computed efficiently with a stochastic approximator. Compared to existing methods, our approach produces portraits with added text and 3D control, where portraits remain consistent when either control is changed. Broadly, this approach lets creators control 3D generators on their own 2D face data without needing resources to label large data or train large models.
我们考虑从大型视觉-语言模型中分离出三维信息的问题,并通过生成的三维肖像展示了这一过程。这使得我们可以用自由形式的文字控制外观属性(如年龄、发型和眼镜),并通过3D几何学控制面部表情和相机姿态。在这个设置下,假设我们使用一个预训练的大规模视觉-语言模型(LVLM;CLIP)从一个小的2D数据集中生成结果,并且该数据集没有额外配对标签,同时定义了一个预设的三维可变形模型(FLAME)。首先,我们通过将神经3D三平面表示规范到二维参考帧来实现分离。然而,另一种纠缠形式来自于LVLM嵌入空间中的大量噪声,这些噪声描述了无关特征。这种噪声会损害输出质量和多样性,但我们可以通过计算效率高的随机近似器进行雅可比正则化的方法克服这一问题。 与现有方法相比,我们的方法可以在生成的肖像中添加文本和3D控制,当改变任一控制时,肖像仍然保持一致性。总体而言,这种方法让创作者能够在其自己的2D面部数据上控制三维生成模型,并且无需为标注大量数据或训练大型模型投入资源。
https://arxiv.org/abs/2506.14015
This study presents a novel approach to enhance the cost-to-quality ratio of image generation with diffusion models. We hypothesize that differences between distilled (e.g. FLUX.1-schnell) and baseline (e.g. FLUX.1-dev) models are consistent and, therefore, learnable within a specialized domain, like portrait generation. We generate a synthetic paired dataset and train a fast image-to-image translation head. Using two sets of low- and high-quality synthetic images, our model is trained to refine the output of a distilled generator (e.g., FLUX.1-schnell) to a level comparable to a baseline model like FLUX.1-dev, which is more computationally intensive. Our results show that the pipeline, which combines a distilled version of a large generative model with our enhancement layer, delivers similar photorealistic portraits to the baseline version with up to an 82% decrease in computational cost compared to FLUX.1-dev. This study demonstrates the potential for improving the efficiency of AI solutions involving large-scale image generation.
这项研究提出了一种新方法,旨在提高使用扩散模型生成图像的成本与质量比率。我们假设经过蒸馏(例如FLUX.1-schnell)和基准(例如FLUX.1-dev)模型之间的差异在特定领域内是稳定且可学习的,比如人像生成领域。为此,我们生成了一个合成配对数据集,并训练了一个快速图像到图像翻译头。使用两组低质量和高质量的人工合成图像,我们的模型被训练以优化一个蒸馏生成器(例如FLUX.1-schnell)的输出,使其质量接近于计算资源需求更高的基准模型如FLUX.1-dev。 研究结果表明,结合大尺寸生成模型的简化版本与增强层的管道能够提供类似于基线版本的逼真图像,但相比FLUX.1-dev可降低高达82%的计算成本。这项研究表明,在大规模图像生成涉及的人工智能解决方案中,存在提高效率的巨大潜力。
https://arxiv.org/abs/2505.02255
Recent advances in Talking Head Generation (THG) have achieved impressive lip synchronization and visual quality through diffusion models; yet existing methods struggle to generate emotionally expressive portraits while preserving speaker identity. We identify three critical limitations in current emotional talking head generation: insufficient utilization of audio's inherent emotional cues, identity leakage in emotion representations, and isolated learning of emotion correlations. To address these challenges, we propose a novel framework dubbed as DICE-Talk, following the idea of disentangling identity with emotion, and then cooperating emotions with similar characteristics. First, we develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention, representing emotions as identity-agnostic Gaussian distributions. Second, we introduce a correlation-enhanced emotion conditioning module with learnable Emotion Banks that explicitly capture inter-emotion relationships through vector quantization and attention-based feature aggregation. Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process through latent-space classification. Extensive experiments on MEAD and HDTF datasets demonstrate our method's superiority, outperforming state-of-the-art approaches in emotion accuracy while maintaining competitive lip-sync performance. Qualitative results and user studies further confirm our method's ability to generate identity-preserving portraits with rich, correlated emotional expressions that naturally adapt to unseen identities.
最近在 Talking Head Generation (THG) 方面的进展,通过扩散模型实现了令人印象深刻的唇部同步和视觉质量;然而,现有的方法在生成具有情感表达力的同时保持说话者身份方面仍存在困难。我们指出了当前情感面部生成中的三个关键限制:音频中内在的情感线索利用不足、情感表示中的身份泄露以及情感关联的孤立学习。 为了应对这些挑战,我们提出了一种新的框架,称为 DICE-Talk,该框架遵循将身份与情绪解耦然后合作具有相似特征的情绪的理念。首先,我们开发了一个解耦式情感嵌入器,通过跨模态注意力同时建模音频-视觉的情感线索,表示为无身份的高斯分布。其次,我们引入了一种增强关联的情感条件模块,并采用可学习的情感库,该模块通过向量量化和基于注意力的功能聚合明确捕捉了情感之间的关系。第三,我们设计了一个情绪判别目标,在扩散过程中通过潜在空间分类强制执行情感一致性。 在 MEAD 和 HDTF 数据集上的广泛实验表明,我们的方法优于现有最佳方法,在情感准确性方面表现出色的同时保持了竞争性的唇部同步性能。定性结果和用户研究进一步确认了我们方法生成的身份一致且具有丰富相关情感表情的能力,并能够自然适应未见过的身份。
https://arxiv.org/abs/2504.18087
We propose a novel framework for ID-preserving generation using a multi-modal encoding strategy rather than injecting identity features via adapters into pre-trained models. Our method treats identity and text as a unified conditioning input. To achieve this, we introduce FaceCLIP, a multi-modal encoder that learns a joint embedding space for both identity and textual semantics. Given a reference face and a text prompt, FaceCLIP produces a unified representation that encodes both identity and text, which conditions a base diffusion model to generate images that are identity-consistent and text-aligned. We also present a multi-modal alignment algorithm to train FaceCLIP, using a loss that aligns its joint representation with face, text, and image embedding spaces. We then build FaceCLIP-SDXL, an ID-preserving image synthesis pipeline by integrating FaceCLIP with Stable Diffusion XL (SDXL). Compared to prior methods, FaceCLIP-SDXL enables photorealistic portrait generation with better identity preservation and textual relevance. Extensive experiments demonstrate its quantitative and qualitative superiority.
我们提出了一种新颖的框架,用于通过多模态编码策略而非使用适配器向预训练模型注入身份特征来进行保持身份特性的生成。我们的方法将身份和文本视为统一的条件输入。为此,我们引入了FaceCLIP,这是一种多模态编码器,能够为身份和文本语义学习联合嵌入空间。给定一个参考人脸图像和一段文字提示,FaceCLIP可以产生一种同时包含身份信息和文本内容的统一表示形式,这种表示形式能条件化基础扩散模型以生成既符合身份又与文本相关的图像。此外,我们还提出了一种多模态对齐算法来训练FaceCLIP,该算法使用一种损失函数将其联合表示与人脸、文本及图像嵌入空间进行对齐。接着,我们将FaceCLIP与Stable Diffusion XL (SDXL)集成起来构建了FaceCLIP-SDXL,这是一种保持身份特性的图像合成流水线。相比之前的方法,FaceCLIP-SDXL能够生成更逼真的肖像图片,并且在身份保存和文本相关性方面表现更好。广泛的实验表明其具有定量和定性的优越性能。
https://arxiv.org/abs/2504.14202
Creating a realistic animatable avatar from a single static portrait remains challenging. Existing approaches often struggle to capture subtle facial expressions, the associated global body movements, and the dynamic background. To address these limitations, we propose a novel framework that leverages a pretrained video diffusion transformer model to generate high-fidelity, coherent talking portraits with controllable motion dynamics. At the core of our work is a dual-stage audio-visual alignment strategy. In the first stage, we employ a clip-level training scheme to establish coherent global motion by aligning audio-driven dynamics across the entire scene, including the reference portrait, contextual objects, and background. In the second stage, we refine lip movements at the frame level using a lip-tracing mask, ensuring precise synchronization with audio signals. To preserve identity without compromising motion flexibility, we replace the commonly used reference network with a facial-focused cross-attention module that effectively maintains facial consistency throughout the video. Furthermore, we integrate a motion intensity modulation module that explicitly controls expression and body motion intensity, enabling controllable manipulation of portrait movements beyond mere lip motion. Extensive experimental results show that our proposed approach achieves higher quality with better realism, coherence, motion intensity, and identity preservation. Ours project page: this https URL.
从单张静态肖像创建一个逼真的可动画化身仍然具有挑战性。现有方法往往难以捕捉微妙的面部表情、相关的全身动作以及动态背景。为了解决这些限制,我们提出了一种新颖框架,利用预训练的视频扩散变换器模型生成高保真度、连贯的说话头像,并且可以控制运动动力学。我们的工作核心是一种双阶段音频-视觉对齐策略。 在第一阶段,我们采用片段级训练方案,通过在整个场景中(包括参考肖像、上下文对象和背景)对准由音频驱动的动力学来建立连贯的整体运动。在第二阶段,我们使用唇部跟踪掩码以帧为单位细化嘴唇动作,确保与音频信号的精确同步。 为了保持身份一致性而不牺牲运动灵活性,我们将常用的参考网络替换为面部聚焦的跨注意力模块,该模块在整个视频中有效维持面部一致性。此外,我们整合了一个运动强度调节模块,它明确控制表情和身体运动强度,从而实现头像动作(不仅仅是唇部动作)可操控地调整。 广泛的实验结果表明,我们的方法在质量和现实感、连贯性、运动强度和身份保持方面均优于现有技术。 有关我们的项目的更多信息,请访问此链接:[项目页面链接] (请将“this https URL”替换为实际的项目页面URL)。
https://arxiv.org/abs/2504.04842
Generating naturalistic and nuanced listener motions for extended interactions remains an open problem. Existing methods often rely on low-dimensional motion codes for facial behavior generation followed by photorealistic rendering, limiting both visual fidelity and expressive richness. To address these challenges, we introduce DiTaiListener, powered by a video diffusion model with multimodal conditions. Our approach first generates short segments of listener responses conditioned on the speaker's speech and facial motions with DiTaiListener-Gen. It then refines the transitional frames via DiTaiListener-Edit for a seamless transition. Specifically, DiTaiListener-Gen adapts a Diffusion Transformer (DiT) for the task of listener head portrait generation by introducing a Causal Temporal Multimodal Adapter (CTM-Adapter) to process speakers' auditory and visual cues. CTM-Adapter integrates speakers' input in a causal manner into the video generation process to ensure temporally coherent listener responses. For long-form video generation, we introduce DiTaiListener-Edit, a transition refinement video-to-video diffusion model. The model fuses video segments into smooth and continuous videos, ensuring temporal consistency in facial expressions and image quality when merging short video segments produced by DiTaiListener-Gen. Quantitatively, DiTaiListener achieves the state-of-the-art performance on benchmark datasets in both photorealism (+73.8% in FID on RealTalk) and motion representation (+6.1% in FD metric on VICO) spaces. User studies confirm the superior performance of DiTaiListener, with the model being the clear preference in terms of feedback, diversity, and smoothness, outperforming competitors by a significant margin.
生成自然且细腻的听者动作,以支持长时间互动的问题仍然未得到解决。现有方法通常依赖于低维运动代码来生成面部行为,并随后进行逼真的渲染,这既限制了视觉保真度也削弱了表现力的丰富性。为了应对这些挑战,我们引入了DiTaiListener,它是由多模态条件下的视频扩散模型驱动的。我们的方法首先使用DiTaiListener-Gen根据说话人的语音和面部动作生成听众反应的短片段。然后通过DiTaiListener-Edit改进过渡帧以实现无缝连接。 具体来说,DiTaiListener-Gen采用了一种经过改编的Diffusion Transformer(DiT)用于听者头像生成任务,并引入了一个因果时间多模态适配器(CTM-Adapter),用以处理说话人的音频和视觉线索。CTM-Adapter将说话人输入以因果方式整合到视频生成过程中,确保了在产生连贯且一致的听众反应时的时间连续性。 对于长时间视频生成,我们引入了DiTaiListener-Edit,这是一个用于过渡细化的视频到视频扩散模型。该模型融合短片段视频以生成流畅且连贯的长视频,确保在将由DiTaiListener-Gen产生的短视频片段合并后,在面部表情和图像质量方面的时间一致性。 从量化指标来看,DiTaiListener在基准数据集上的表现达到了最先进的水平,分别在逼真度(RealTalk数据集上FID得分提升73.8%)和运动表示能力(VICO数据集上FD指标提高6.1%)。用户研究证实了DiTaiListener的优越性,模型在反馈、多样性和流畅性方面明显优于竞争对手。
https://arxiv.org/abs/2504.04010
Text-to-image diffusion models excel at generating diverse portraits, but lack intuitive shadow control. Existing editing approaches, as post-processing, struggle to offer effective manipulation across diverse styles. Additionally, these methods either rely on expensive real-world light-stage data collection or require extensive computational resources for training. To address these limitations, we introduce Shadow Director, a method that extracts and manipulates hidden shadow attributes within well-trained diffusion models. Our approach uses a small estimation network that requires only a few thousand synthetic images and hours of training-no costly real-world light-stage data needed. Shadow Director enables parametric and intuitive control over shadow shape, placement, and intensity during portrait generation while preserving artistic integrity and identity across diverse styles. Despite training only on synthetic data built on real-world identities, it generalizes effectively to generated portraits with diverse styles, making it a more accessible and resource-friendly solution.
文本到图像的扩散模型在生成多样的肖像方面表现出色,但缺乏直观的阴影控制。现有的编辑方法作为后处理手段,在处理不同风格时难以提供有效的操作。此外,这些方法要么依赖于昂贵的真实世界光舞台数据收集,要么需要大量的计算资源进行训练。为了解决这些问题,我们介绍了Shadow Director方法,该方法可以从已经训练好的扩散模型中提取并操纵隐藏的阴影属性。我们的方法使用一个小型估计网络,只需要几千张合成图像和几个小时的训练时间——无需昂贵的真实世界光舞台数据。 Shadow Director在生成肖像时提供了参数化且直观的阴影形状、位置及强度控制,并能在保持艺术完整性和身份一致性的前提下应用于各种风格中。尽管仅基于真实世界的身份构建并经过少量合成数据训练,它仍然能够有效地推广到具有多样风格的生成肖像上,使其成为一个更易于使用和资源友好的解决方案。
https://arxiv.org/abs/2503.21943
Personalized portrait synthesis, essential in domains like social entertainment, has recently made significant progress. Person-wise fine-tuning based methods, such as LoRA and DreamBooth, can produce photorealistic outputs but need training on individual samples, consuming time and resources and posing an unstable risk. Adapter based techniques such as IP-Adapter freeze the foundational model parameters and employ a plug-in architecture to enable zero-shot inference, but they often exhibit a lack of naturalness and authenticity, which are not to be overlooked in portrait synthesis tasks. In this paper, we introduce a parameter-efficient adaptive generation method, namely HyperLoRA, that uses an adaptive plug-in network to generate LoRA weights, merging the superior performance of LoRA with the zero-shot capability of adapter scheme. Through our carefully designed network structure and training strategy, we achieve zero-shot personalized portrait generation (supporting both single and multiple image inputs) with high photorealism, fidelity, and editability.
个人肖像合成技术在社交娱乐等领域中至关重要,最近取得了显著进展。基于个人微调的方法(如LoRA和DreamBooth)可以生成逼真的图像输出,但需要对每个样本进行训练,这会消耗大量时间和资源,并且存在不稳定的隐患。而基于适配器的技术(例如IP-Adapter),冻结基础模型参数并采用插件架构以实现零样本推理,但在肖像合成任务中往往缺乏自然感和真实性。 在本文中,我们提出了一种参数高效的自适应生成方法——HyperLoRA,该方法使用一个自适应的插件网络来生成LoRA权重,从而结合了LoRA的优越性能与适配器方案的零样本推理能力。通过精心设计的网络结构和训练策略,我们的方法能够实现高逼真度、保真度及可编辑性的零样本个性化肖像生成(支持单图或多图输入)。
https://arxiv.org/abs/2503.16944
Audio-driven single-image talking portrait generation plays a crucial role in virtual reality, digital human creation, and filmmaking. Existing approaches are generally categorized into keypoint-based and image-based methods. Keypoint-based methods effectively preserve character identity but struggle to capture fine facial details due to the fixed points limitation of the 3D Morphable Model. Moreover, traditional generative networks face challenges in establishing causality between audio and keypoints on limited datasets, resulting in low pose diversity. In contrast, image-based approaches produce high-quality portraits with diverse details using the diffusion network but incur identity distortion and expensive computational costs. In this work, we propose KDTalker, the first framework to combine unsupervised implicit 3D keypoint with a spatiotemporal diffusion model. Leveraging unsupervised implicit 3D keypoints, KDTalker adapts facial information densities, allowing the diffusion process to model diverse head poses and capture fine facial details flexibly. The custom-designed spatiotemporal attention mechanism ensures accurate lip synchronization, producing temporally consistent, high-quality animations while enhancing computational efficiency. Experimental results demonstrate that KDTalker achieves state-of-the-art performance regarding lip synchronization accuracy, head pose diversity, and execution this http URL codes are available at this https URL.
音频驱动的单张图像说话肖像生成在虚拟现实、数字人创建和电影制作中扮演着重要角色。现有的方法通常被分为基于关键点的方法和基于图像的方法。基于关键点的方法有效地保留了人物的身份,但因3D可变形模型(3D Morphable Model)固定点限制而难以捕捉到精细的面部细节。此外,传统的生成网络在使用有限数据集建立音频与关键点之间的因果关系时面临挑战,导致姿势多样性较低。相比之下,基于图像的方法通过扩散网络能够产生具有丰富细节和高质量的人物肖像,但会导致身份失真,并且计算成本高昂。 在这项工作中,我们提出了KDTalker框架,这是第一个结合无监督隐式3D关键点与时空扩散模型的框架。利用无监督隐式的3D关键点,KDTalker可以根据面部信息密度调整面部信息,使得扩散过程能够灵活地模拟多样的头部姿势,并捕捉到细微的面部细节。此外,自定义设计的时空注意力机制确保了准确的唇部同步,在提高计算效率的同时产生了时间一致性高、高质量的动画。 实验结果表明,KDTalker在唇部同步准确性、头部姿态多样性以及执行效率方面均达到了最先进的水平。代码可在提供的链接中获取。
https://arxiv.org/abs/2503.12963
Recent Customized Portrait Generation (CPG) methods, taking a facial image and a textual prompt as inputs, have attracted substantial attention. Although these methods generate high-fidelity portraits, they fail to prevent the generated portraits from being tracked and misused by malicious face recognition systems. To address this, this paper proposes a Customized Portrait Generation framework with facial Adversarial attacks (Adv-CPG). Specifically, to achieve facial privacy protection, we devise a lightweight local ID encryptor and an encryption enhancer. They implement progressive double-layer encryption protection by directly injecting the target identity and adding additional identity guidance, respectively. Furthermore, to accomplish fine-grained and personalized portrait generation, we develop a multi-modal image customizer capable of generating controlled fine-grained facial features. To the best of our knowledge, Adv-CPG is the first study that introduces facial adversarial attacks into CPG. Extensive experiments demonstrate the superiority of Adv-CPG, e.g., the average attack success rate of the proposed Adv-CPG is 28.1% and 2.86% higher compared to the SOTA noise-based attack methods and unconstrained attack methods, respectively.
最近定制化肖像生成(CPG)方法,通过面部图像和文本提示作为输入,引起了广泛关注。尽管这些方法能够生成高保真的面部画像,但它们无法防止生成的画像被恶意人脸识别系统追踪并滥用。为了解决这个问题,本文提出了一种结合了面部对抗攻击的定制化肖像生成框架(Adv-CPG)。为了实现面部隐私保护,我们设计了一个轻量级的身份加密器和一个增强器。身份加密器通过直接注入目标身份信息来实施渐进式的双层加密保护;而增强器则是通过添加额外的身份指导来加强这一过程。 此外,为了完成精细且个性化的肖像生成,我们开发了一种多模态图像定制器,能够生成受控的细粒度面部特征。据我们所知,Adv-CPG是第一个将面部对抗攻击引入到CPG中的研究工作。广泛的实验展示了Adv-CPG的优势,例如:提出的Adv-CPG方法平均攻击成功率达到了28.1%,比最先进的基于噪声的方法高出2.86%。 翻译成中文如下: 最近定制化肖像生成(Customized Portrait Generation, CPG)方法因其能够利用面部图像和文本提示作为输入而吸引了大量关注。尽管这些方法可以生成高保真的面部画像,但它们无法防止生成的画像被恶意的人脸识别系统追踪并滥用。为了解决这个问题,本文提出了一种结合了面部对抗攻击(Adversarial attacks, Adv)的定制化肖像生成框架(Adv-CPG)。为了实现面部隐私保护,我们设计了一个轻量级的身份加密器和一个增强器。身份加密器通过直接注入目标身份信息来实施渐进式的双层加密保护;而增强器则是通过添加额外的身份指导来加强这一过程。 此外,为了完成精细且个性化的肖像生成,我们开发了一种多模态图像定制器,能够生成受控的细粒度面部特征。据我们所知,Adv-CPG是第一个将面部对抗攻击引入到CPG中的研究工作。广泛的实验展示了Adv-CPG的优势,例如:提出的Adv-CPG方法平均攻击成功率达到了28.1%,比最先进的基于噪声的方法高出2.86%。
https://arxiv.org/abs/2503.08269
Text-to-image generative models have shown remarkable progress in producing diverse and photorealistic outputs. In this paper, we present a comprehensive analysis of their effectiveness in creating synthetic portraits that accurately represent various demographic attributes, with a special focus on age, nationality, and gender. Our evaluation employs prompts specifying detailed profiles (e.g., Photorealistic selfie photo of a 32-year-old Canadian male), covering a broad spectrum of 212 nationalities, 30 distinct ages from 10 to 78, and balanced gender representation. We compare the generated images against ground truth age estimates from two established age estimation models to assess how faithfully age is depicted. Our findings reveal that although text-to-image models can consistently generate faces reflecting different identities, the accuracy with which they capture specific ages and do so across diverse demographic backgrounds remains highly variable. These results suggest that current synthetic data may be insufficiently reliable for high-stakes age-related tasks requiring robust precision, unless practitioners are prepared to invest in significant filtering and curation. Nevertheless, they may still be useful in less sensitive or exploratory applications, where absolute age precision is not critical.
文本到图像的生成模型在产生多样性和照片般逼真的输出方面取得了显著进展。在这篇论文中,我们对这些模型在创建能够准确反映各种人口统计特征(特别是年龄、国籍和性别)的人工肖像方面的有效性进行了全面分析。我们的评估使用了指定详细个人资料的提示语(例如,“一个32岁的加拿大男性的现实自拍照”),涵盖了212个不同国家,从10岁到78岁的30种不同的年龄段,并且性别比例均衡。我们通过与两个已建立的年龄估计模型提供的真实年龄估计进行比较来评估生成图像中年龄描绘的真实程度。 我们的研究发现表明,尽管文本到图像的模型能够持续生成反映不同身份特征的脸部图像,但它们捕捉特定年龄以及在多样化的人口统计背景下的准确性仍然存在很大的变异性。这些结果暗示当前合成数据可能不足以用于需要高度精确度的关键性年龄相关任务,除非从业者愿意投入大量精力进行过滤和筛选工作。然而,在不敏感或探索性的应用中,即使绝对的年龄精度不是关键因素,它们仍可能具有一定的实用性。
https://arxiv.org/abs/2502.03420
Existing diffusion models show great potential for identity-preserving generation. However, personalized portrait generation remains challenging due to the diversity in user profiles, including variations in appearance and lighting conditions. To address these challenges, we propose IC-Portrait, a novel framework designed to accurately encode individual identities for personalized portrait generation. Our key insight is that pre-trained diffusion models are fast learners (e.g.,100 ~ 200 steps) for in-context dense correspondence matching, which motivates the two major designs of our IC-Portrait framework. Specifically, we reformulate portrait generation into two sub-tasks: 1) Lighting-Aware Stitching: we find that masking a high proportion of the input image, e.g., 80%, yields a highly effective self-supervisory representation learning of reference image lighting. 2) View-Consistent Adaptation: we leverage a synthetic view-consistent profile dataset to learn the in-context correspondence. The reference profile can then be warped into arbitrary poses for strong spatial-aligned view conditioning. Coupling these two designs by simply concatenating latents to form ControlNet-like supervision and modeling, enables us to significantly enhance the identity preservation fidelity and stability. Extensive evaluations demonstrate that IC-Portrait consistently outperforms existing state-of-the-art methods both quantitatively and qualitatively, with particularly notable improvements in visual qualities. Furthermore, IC-Portrait even demonstrates 3D-aware relighting capabilities.
现有的扩散模型在保持身份特性的生成方面展现出了巨大的潜力。然而,由于用户资料的多样性(包括外观和光照条件的变化),个性化肖像生成仍然面临挑战。为了解决这些问题,我们提出了IC-Portrait,这是一个全新的框架,旨在准确编码个人身份以进行个性化的肖像生成。我们的关键见解是,预训练的扩散模型在上下文中密集对应匹配方面学习迅速(例如100到200步),这激励了我们IC-Portrait框架的主要设计。 具体而言,我们将肖像生成重新表述为两个子任务: 1. **光照感知拼接**:我们发现对输入图像进行高度遮挡处理(例如80%),可以非常有效地学习参考图的自监督表示,从而更好地理解光照条件。 2. **视图一致性适应**:利用合成的视图一致资料集来学习上下文中的对应关系。这使得参考轮廓能够被变形到任意姿势,从而提供强大的空间对齐视图调节。 通过简单地连接潜在变量形成类似ControlNet的监督和建模方式,将这两种设计结合起来,我们显著增强了身份保持的准确度和稳定性。广泛的评估显示,IC-Portrait在定量和定性评价中均优于现有的最先进的方法,并且在视觉质量方面有了特别明显的改进。此外,IC-Portrait还展示了3D感知重光照的能力。
https://arxiv.org/abs/2501.17159
This paper aims to bring fine-grained expression control to identity-preserving portrait generation. Existing methods tend to synthesize portraits with either neutral or stereotypical expressions. Even when supplemented with control signals like facial landmarks, these models struggle to generate accurate and vivid expressions following user instructions. To solve this, we introduce EmojiDiff, an end-to-end solution to facilitate simultaneous dual control of fine expression and identity. Unlike the conventional methods using coarse control signals, our method directly accepts RGB expression images as input templates to provide extremely accurate and fine-grained expression control in the diffusion process. As its core, an innovative decoupled scheme is proposed to disentangle expression features in the expression template from other extraneous information, such as identity, skin, and style. On one hand, we introduce \textbf{I}D-irrelevant \textbf{D}ata \textbf{I}teration (IDI) to synthesize extremely high-quality cross-identity expression pairs for decoupled training, which is the crucial foundation to filter out identity information hidden in the expressions. On the other hand, we meticulously investigate network layer function and select expression-sensitive layers to inject reference expression features, effectively preventing style leakage from expression signals. To further improve identity fidelity, we propose a novel fine-tuning strategy named \textbf{I}D-enhanced \textbf{C}ontrast \textbf{A}lignment (ICA), which eliminates the negative impact of expression control on original identity preservation. Experimental results demonstrate that our method remarkably outperforms counterparts, achieves precise expression control with highly maintained identity, and generalizes well to various diffusion models.
https://arxiv.org/abs/2412.01254
Recent diffusion-based Single-image 3D portrait generation methods typically employ 2D diffusion models to provide multi-view knowledge, which is then distilled into 3D representations. However, these methods usually struggle to produce high-fidelity 3D models, frequently yielding excessively blurred textures. We attribute this issue to the insufficient consideration of cross-view consistency during the diffusion process, resulting in significant disparities between different views and ultimately leading to blurred 3D representations. In this paper, we address this issue by comprehensively exploiting multi-view priors in both the conditioning and diffusion procedures to produce consistent, detail-rich portraits. From the conditioning standpoint, we propose a Hybrid Priors Diffsion model, which explicitly and implicitly incorporates multi-view priors as conditions to enhance the status consistency of the generated multi-view portraits. From the diffusion perspective, considering the significant impact of the diffusion noise distribution on detailed texture generation, we propose a Multi-View Noise Resamplig Strategy integrated within the optimization process leveraging cross-view priors to enhance representation consistency. Extensive experiments demonstrate that our method can produce 3D portraits with accurate geometry and rich details from a single image. The project page is at \url{this https URL}.
近期基于扩散的单图像三维肖像生成方法通常采用二维扩散模型来提供多视角知识,然后将这些信息提炼成三维表示。然而,这些方法常常难以生成高保真的三维模型,经常导致纹理过度模糊。我们把这个问题归因于在扩散过程中对跨视图一致性考虑不足,这导致不同视图之间存在显著差异,并最终使得三维表示变得模糊。在这篇论文中,我们通过全面利用条件设置和扩散过程中的多视角先验来解决这一问题,从而生成一致且细节丰富的肖像。从条件设定的角度来看,我们提出了一种混合先验扩散模型,该模型显式和隐式地将多视图先验作为条件,以增强生成的多视图肖像的状态一致性。从扩散角度看,考虑到扩散噪声分布对详细纹理生成的重大影响,我们提出了一个结合跨视角先验并整合在优化过程中的多视角噪声重采样策略,用以增强表示的一致性。大量的实验表明,我们的方法可以从单张图像中生成具有准确几何结构和丰富细节的三维肖像。该项目页面位于 \url{this https URL}。
https://arxiv.org/abs/2411.10369
We propose MegaPortrait. It's an innovative system for creating personalized portrait images in computer vision. It has three modules: Identity Net, Shading Net, and Harmonization Net. Identity Net generates learned identity using a customized model fine-tuned with source images. Shading Net re-renders portraits using extracted representations. Harmonization Net fuses pasted faces and the reference image's body for coherent results. Our approach with off-the-shelf Controlnets is better than state-of-the-art AI portrait products in identity preservation and image fidelity. MegaPortrait has a simple but effective design and we compare it with other methods and products to show its superiority.
我们提出了一种名为MegaPortrait的系统,这是一种用于计算机视觉中创建个性化肖像图像的创新性系统。该系统包括三个模块:Identity Net(身份网络)、Shading Net(阴影网络)和Harmonization Net(和谐网络)。Identity Net利用经过源图像微调的定制模型生成学习到的身份特征。Shading Net使用提取的表示重新渲染肖像。Harmonization Net融合粘贴的脸部与参考图像的身体部分,以获得一致的结果。我们的方法结合现成的Controlnets,在身份保持和图像保真度方面优于现有的AI肖像产品。MegaPortrait的设计简单而有效,并且我们将其与其他方法和产品进行了比较,证明了其优越性。
https://arxiv.org/abs/2411.04357
In the field of human-centric personalized image generation, the adapter-based method obtains the ability to customize and generate portraits by text-to-image training on facial data. This allows for identity-preserved personalization without additional fine-tuning in inference. Although there are improvements in efficiency and fidelity, there is often a significant performance decrease in test following ability, controllability, and diversity of generated faces compared to the base model. In this paper, we analyze that the performance degradation is attributed to the failure to decouple identity features from other attributes during extraction, as well as the failure to decouple the portrait generation training from the overall generation task. To address these issues, we propose the Face Adapter with deCoupled Training (FACT) framework, focusing on both model architecture and training strategy. To decouple identity features from others, we leverage a transformer-based face-export encoder and harness fine-grained identity features. To decouple the portrait generation training, we propose Face Adapting Increment Regularization~(FAIR), which effectively constrains the effect of face adapters on the facial region, preserving the generative ability of the base model. Additionally, we incorporate a face condition drop and shuffle mechanism, combined with curriculum learning, to enhance facial controllability and diversity. As a result, FACT solely learns identity preservation from training data, thereby minimizing the impact on the original text-to-image capabilities of the base model. Extensive experiments show that FACT has both controllability and fidelity in both text-to-image generation and inpainting solutions for portrait generation.
https://arxiv.org/abs/2410.12312
Portrait Fidelity Generation is a prominent research area in generative models, with a primary focus on enhancing both controllability and fidelity. Current methods face challenges in generating high-fidelity portrait results when faces occupy a small portion of the image with a low resolution, especially in multi-person group photo settings. To tackle these issues, we propose a systematic solution called MagicID, based on a self-constructed million-level multi-modal dataset named IDZoom. MagicID consists of Multi-Mode Fusion training strategy (MMF) and DDIM Inversion based ID Restoration inference framework (DIIR). During training, MMF iteratively uses the skeleton and landmark modalities from IDZoom as conditional guidance. By introducing the Clone Face Tuning in training stage and Mask Guided Multi-ID Cross Attention (MGMICA) in inference stage, explicit constraints on face positional features are achieved for multi-ID group photo generation. The DIIR aims to address the issue of artifacts. The DDIM Inversion is used in conjunction with face landmarks, global and local face features to achieve face restoration while keeping the background unchanged. Additionally, DIIR is plug-and-play and can be applied to any diffusion-based portrait generation method. To validate the effectiveness of MagicID, we conducted extensive comparative and ablation experiments. The experimental results demonstrate that MagicID has significant advantages in both subjective and objective metrics, and achieves controllable generation in multi-person scenarios.
肖像可靠性生成是一个在生成模型领域突出研究领域的知名领域,主要关注提高可控制性和可靠性。当面对低分辨率时,当前方法在生成高保真度肖像结果时存在挑战,尤其是在多人物群照设置中。为解决这些问题,我们提出了一个基于自构建的100万级多模态数据集IDZoom的系统解决方案,称为MagicID。MagicID包括多模态融合训练策略(MMF)和基于ID修复的DDIM Inversion。在训练过程中,MMF使用IDZoom中的骨架和关键点模态作为条件指导。通过引入训练阶段的人脸克隆和引导多ID照Cross注意力(MGMICA)以及在推理阶段使用DDIM Inversion,实现了对面部位置特征的显式约束,用于多ID照群照生成。 DIIR旨在解决伪影问题。将DDIM Inversion与面部关键点、全局和局部面部特征相结合,可以在保留背景不变的情况下实现面部修复。此外,DIIR是可插拔的,可以应用于任何基于扩散的肖像生成方法。为了验证MagicID的有效性,我们进行了广泛的比较和消融实验。实验结果表明,MagicID在主观和客观指标上具有显著优势,并在多人物群照场景中实现了可控制性的生成。
https://arxiv.org/abs/2408.09248
Customized image generation, which seeks to synthesize images with consistent characters, holds significant relevance for applications such as storytelling, portrait generation, and character design. However, previous approaches have encountered challenges in preserving characters with high-fidelity consistency due to inadequate feature extraction and concept confusion of reference characters. Therefore, we propose Character-Adapter, a plug-and-play framework designed to generate images that preserve the details of reference characters, ensuring high-fidelity consistency. Character-Adapter employs prompt-guided segmentation to ensure fine-grained regional features of reference characters and dynamic region-level adapters to mitigate concept confusion. Extensive experiments are conducted to validate the effectiveness of Character-Adapter. Both quantitative and qualitative results demonstrate that Character-Adapter achieves the state-of-the-art performance of consistent character generation, with an improvement of 24.8% compared with other methods
定制图像生成,旨在合成具有一致性字符的图像,对于诸如叙事、肖像生成和角色设计等应用具有重要的相关性。然而,由于参考角色的特征提取不足和参考角色概念混淆,以前的方法在保留高保真度特征方面遇到了挑战。因此,我们提出了Character-Adapter,一个可插拔和使用的框架,旨在生成保留参考角色详细信息的图像,确保高保真度一致性。Character-Adapter采用提示引导分割来确保参考角色的细粒度区域特征,并使用动态区域级别适应器来缓解概念混淆。 extensive实验验证了Character-Adapter的有效性。量化结果表明,Character-Adapter实现了与其他方法相当的最佳性能,性能提高了24.8%。
https://arxiv.org/abs/2406.16537
Talking head synthesis, an advanced method for generating portrait videos from a still image driven by specific content, has garnered widespread attention in virtual reality, augmented reality and game production. Recently, significant breakthroughs have been made with the introduction of novel models such as the transformer and the diffusion model. Current methods can not only generate new content but also edit the generated material. This survey systematically reviews the technology, categorizing it into three pivotal domains: portrait generation, driven mechanisms, and editing techniques. We summarize milestone studies and critically analyze their innovations and shortcomings within each domain. Additionally, we organize an extensive collection of datasets and provide a thorough performance analysis of current methodologies based on various evaluation metrics, aiming to furnish a clear framework and robust data support for future research. Finally, we explore application scenarios of talking head synthesis, illustrate them with specific cases, and examine potential future directions.
谈话头合成是一种先进的方法,用于从特定内容驱动的静态图像中生成视频肖像。它在虚拟现实、增强现实和游戏制作领域引起了广泛关注。近年来,随着新模型的引入,如Transformer和扩散模型,取得了显著的突破。现有的方法不仅能生成新的内容,还可以编辑生成材料。本调查系统地回顾了该技术,将其分为三个关键领域:肖像生成、驱动机制和编辑技术。我们总结了里程碑式的研究,并对其创新和不足进行了批判性分析。此外,我们组织了一组丰富的数据集,对各种评估指标进行了详细的表现分析,旨在为未来的研究提供一个清晰、可靠的框架和数据支持。最后,我们探讨了谈话头合成应用场景,用具体案例进行了说明,并探讨了未来的研究方向。
https://arxiv.org/abs/2406.10553
Diffusion-based technologies have made significant strides, particularly in personalized and customized facialgeneration. However, existing methods face challenges in achieving high-fidelity and detailed identity (ID)consistency, primarily due to insufficient fine-grained control over facial areas and the lack of a comprehensive strategy for ID preservation by fully considering intricate facial details and the overall face. To address these limitations, we introduce ConsistentID, an innovative method crafted for diverseidentity-preserving portrait generation under fine-grained multimodal facial prompts, utilizing only a single reference image. ConsistentID comprises two key components: a multimodal facial prompt generator that combines facial features, corresponding facial descriptions and the overall facial context to enhance precision in facial details, and an ID-preservation network optimized through the facial attention localization strategy, aimed at preserving ID consistency in facial regions. Together, these components significantly enhance the accuracy of ID preservation by introducing fine-grained multimodal ID information from facial regions. To facilitate training of ConsistentID, we present a fine-grained portrait dataset, FGID, with over 500,000 facial images, offering greater diversity and comprehensiveness than existing public facial datasets. % such as LAION-Face, CelebA, FFHQ, and SFHQ. Experimental results substantiate that our ConsistentID achieves exceptional precision and diversity in personalized facial generation, surpassing existing methods in the MyStyle dataset. Furthermore, while ConsistentID introduces more multimodal ID information, it maintains a fast inference speed during generation.
基于扩散的技术取得了显著的进展,特别是在个性化面部生成方面。然而,现有的方法在实现高保真度和详细身份一致性方面面临挑战,主要原因是面部区域细粒度控制不足,以及没有全面考虑面部细节和整个面部以实现ID保留的策略。为了应对这些局限,我们引入了ConsistentID,一种为在细粒度多模态面部提示下生成多样化身份肖像的创新方法,仅使用单个参考图像。ConsistentID包括两个关键组件:一个多模态面部提示生成器,将面部特征、相应的面部描述和整个面部上下文相结合以提高面部细节的精度,和一个通过面部关注局部定位策略优化的ID保留网络,旨在保留面部区域ID的一致性。这些组件一起显著提高了ID保留的准确性,通过引入面部区域的细粒度多模态ID信息。为了方便ConsistentID的训练,我们提出了一个超过50万张面部图片的细粒度肖像数据集FGID,比现有的公共面部数据集(如LAION-Face,CelebA,FFHQ和SFHQ)具有更大的多样性和完整性。实验结果证实,我们的ConsistentID在个性化面部生成方面实现了非凡的精度和多样性,超过了MyStyle数据集中的现有方法。此外,虽然ConsistentID引入了更多的多模态ID信息,但在生成过程中保持了快速的推理速度。
https://arxiv.org/abs/2404.16771