Recent advancements in diffusion-based technologies have made significant strides, particularly in identity-preserved portrait generation (IPG). However, when using multiple reference images from the same ID, existing methods typically produce lower-fidelity portraits and struggle to customize face attributes precisely. To address these issues, this paper presents HiFi-Portrait, a high-fidelity method for zero-shot portrait generation. Specifically, we first introduce the face refiner and landmark generator to obtain fine-grained multi-face features and 3D-aware face landmarks. The landmarks include the reference ID and the target attributes. Then, we design HiFi-Net to fuse multi-face features and align them with landmarks, which improves ID fidelity and face control. In addition, we devise an automated pipeline to construct an ID-based dataset for training HiFi-Portrait. Extensive experimental results demonstrate that our method surpasses the SOTA approaches in face similarity and controllability. Furthermore, our method is also compatible with previous SDXL-based works.
最近在基于扩散的技术方面取得了一些显著进展,特别是在保留身份的肖像生成(IPG)领域。然而,当使用来自同一ID的多张参考图像时,现有方法通常会产生较低质量的肖像,并且难以精确定制面部属性。为了解决这些问题,本文提出了一种名为HiFi-Portrait的方法,这是一种用于零样本肖像生成的高保真技术。 具体来说,我们首先引入了面部精炼器和地标生成器来获取细粒度多张人脸特征及具有3D感知的人脸地标信息。这些地标包含参考ID和目标属性的信息。然后,我们设计了HiFi-Net网络用于融合多个人脸特征,并将它们与地标对齐,这提升了身份保真度并增强了面部控制能力。 此外,我们还开发了一种自动化管道来构建基于ID的数据集,以便训练HiFi-Portrait模型。广泛的实验结果表明,我们的方法在人脸相似性和可控性方面超越了现有的最先进(SOTA)方法。而且,我们的方法还可以与之前的SDXL相关工作兼容使用。
https://arxiv.org/abs/2512.14542
Personalized dual-person portrait customization has considerable potential applications, such as preserving emotional memories and facilitating wedding photography planning. However, the absence of a benchmark dataset hinders the pursuit of high-quality customization in dual-person portrait generation. In this paper, we propose the PairHuman dataset, which is the first large-scale benchmark dataset specifically designed for generating dual-person portraits that meet high photographic standards. The PairHuman dataset contains more than 100K images that capture a variety of scenes, attire, and dual-person interactions, along with rich metadata, including detailed image descriptions, person localization, human keypoints, and attribute tags. We also introduce DHumanDiff, which is a baseline specifically crafted for dual-person portrait generation that features enhanced facial consistency and simultaneously balances in personalized person generation and semantic-driven scene creation. Finally, the experimental results demonstrate that our dataset and method produce highly customized portraits with superior visual quality that are tailored to human preferences. Our dataset is publicly available at this https URL.
https://arxiv.org/abs/2511.16712
Unlike existing methods that rely on source images as appearance references and use source speech to generate motion, this work proposes a novel approach that directly extracts information from the speech, addressing key challenges in speech-to-talking face. Specifically, we first employ a speech-to-face portrait generation stage, utilizing a speech-conditioned diffusion model combined with statistical facial prior and a sample-adaptive weighting module to achieve high-quality portrait generation. In the subsequent speech-driven talking face generation stage, we embed expressive dynamics such as lip movement, facial expressions, and eye movements into the latent space of the diffusion model and further optimize lip synchronization using a region-enhancement module. To generate high-resolution outputs, we integrate a pre-trained Transformer-based discrete codebook with an image rendering network, enhancing video frame details in an end-to-end manner. Experimental results demonstrate that our method outperforms existing approaches on the HDTF, VoxCeleb, and AVSpeech datasets. Notably, this is the first method capable of generating high-resolution, high-quality talking face videos exclusively from a single speech input.
https://arxiv.org/abs/2510.26819
We introduce TurboPortrait3D: a method for low-latency novel-view synthesis of human portraits. Our approach builds on the observation that existing image-to-3D models for portrait generation, while capable of producing renderable 3D representations, are prone to visual artifacts, often lack of detail, and tend to fail at fully preserving the identity of the subject. On the other hand, image diffusion models excel at generating high-quality images, but besides being computationally expensive, are not grounded in 3D and thus are not directly capable of producing multi-view consistent outputs. In this work, we demonstrate that image-space diffusion models can be used to significantly enhance the quality of existing image-to-avatar methods, while maintaining 3D-awareness and running with low-latency. Our method takes a single frontal image of a subject as input, and applies a feedforward image-to-avatar generation pipeline to obtain an initial 3D representation and corresponding noisy renders. These noisy renders are then fed to a single-step diffusion model which is conditioned on input image(s), and is specifically trained to refine the renders in a multi-view consistent way. Moreover, we introduce a novel effective training strategy that includes pre-training on a large corpus of synthetic multi-view data, followed by fine-tuning on high-quality real images. We demonstrate that our approach both qualitatively and quantitatively outperforms current state-of-the-art for portrait novel-view synthesis, while being efficient in time.
https://arxiv.org/abs/2510.23929
Recent advances in audio-driven avatar video generation have significantly enhanced audio-visual realism. However, existing methods treat instruction conditioning merely as low-level tracking driven by acoustic or visual cues, without modeling the communicative purpose conveyed by the instructions. This limitation compromises their narrative coherence and character expressiveness. To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework that unifies multimodal instruction understanding with photorealistic portrait generation. Our approach adopts a two-stage pipeline. In the first stage, we design a multimodal large language model (MLLM) director that produces a blueprint video conditioned on diverse instruction signals, thereby governing high-level semantics such as character motion and emotions. In the second stage, guided by blueprint keyframes, we generate multiple sub-clips in parallel using a first-last frame strategy. This global-to-local framework preserves fine-grained details while faithfully encoding the high-level intent behind multimodal instructions. Our parallel architecture also enables fast and stable generation of long-duration videos, making it suitable for real-world applications such as digital human livestreaming and vlogging. To comprehensively evaluate our method, we construct a benchmark of 375 curated samples covering diverse instructions and challenging scenarios. Extensive experiments demonstrate that Kling-Avatar is capable of generating vivid, fluent, long-duration videos at up to 1080p and 48 fps, achieving superior performance in lip synchronization accuracy, emotion and dynamic expressiveness, instruction controllability, identity preservation, and cross-domain generalization. These results establish Kling-Avatar as a new benchmark for semantically grounded, high-fidelity audio-driven avatar synthesis.
最近在音频驱动的虚拟化身视频生成方面取得的进步显著提升了视听真实感。然而,现有的方法仅仅将指令条件化视为由声学或视觉线索驱动的基本级跟踪,并没有建模指令传达的沟通意图。这种局限性损害了它们的故事连贯性和角色表现力。为了解决这一问题,我们提出了Kling-Avatar,这是一种新的级联框架,它统一了多模式指令理解和逼真的肖像生成。 我们的方法采用了两阶段管道。在第一阶段中,我们设计了一种多模态大型语言模型(MLLM)导演,该导演根据各种指令信号生成蓝图视频,从而控制角色动作和情感等高层次语义。在第二阶段中,在蓝本关键帧的引导下,我们使用首尾帧策略并行生成多个子片段。这种从全局到局部的方法在保持细粒度细节的同时,忠实地编码了多模式指令背后的高层次意图。 我们的并行架构还能够快速而稳定地生成长时间视频,使其适用于数字人类直播和Vlogging等现实世界应用中。为了全面评估我们方法的效果,我们构建了一个基准测试集,包括375个精心挑选的样本,涵盖了各种各样的指令和挑战性的场景。广泛的实验表明,Kling-Avatar能够以高达1080p和48fps的速度生成生动、流畅、长时间视频,在唇同步准确度、情感及动态表现力、指令可控性、身份保持以及跨域泛化能力方面表现出色。 这些结果使Kling-Avatar成为语义支持的高保真音频驱动虚拟化身合成的新基准。
https://arxiv.org/abs/2509.09595
Recently, personalized portrait generation with a text-to-image diffusion model has significantly advanced with Textual Inversion, emerging as a promising approach for creating high-fidelity personalized images. Despite its potential, current Textual Inversion methods struggle to maintain consistent facial identity due to semantic misalignments between textual and visual embedding spaces regarding identity. We introduce ID-EA, a novel framework that guides text embeddings to align with visual identity embeddings, thereby improving identity preservation in a personalized generation. ID-EA comprises two key components: the ID-driven Enhancer (ID-Enhancer) and the ID-conditioned Adapter (ID-Adapter). First, the ID-Enhancer integrates identity embeddings with a textual ID anchor, refining visual identity embeddings derived from a face recognition model using representative text embeddings. Then, the ID-Adapter leverages the identity-enhanced embedding to adapt the text condition, ensuring identity preservation by adjusting the cross-attention module in the pre-trained UNet model. This process encourages the text features to find the most related visual clues across the foreground snippets. Extensive quantitative and qualitative evaluations demonstrate that ID-EA substantially outperforms state-of-the-art methods in identity preservation metrics while achieving remarkable computational efficiency, generating personalized portraits approximately 15 times faster than existing approaches.
最近,使用文本到图像扩散模型进行个性化肖像生成在引入Textual Inversion技术后取得了显著进展,成为创建高保真个性化图像的一种有前景的方法。尽管具有巨大潜力,现有的Textual Inversion方法却难以保持面部身份的一致性,因为文本和视觉嵌入空间之间关于身份的语义不匹配导致了这个问题。我们提出了ID-EA(Identity-Driven Embedding Alignment),这是一种新的框架,它引导文本嵌入与视觉身份嵌入对齐,从而在个性化生成中提高身份保存的效果。 ID-EA包括两个关键组成部分:由身份驱动的增强器(ID-Enhancer)和根据身份条件适配器(ID-Adapter)。首先,ID-Enhancer 将身份嵌入与文本 ID 锚点集成起来,并使用来自面部识别模型的视觉身份嵌入对代表性文本嵌入进行细化。然后,ID-Adapter 利用增强后的身份嵌入来调整文本条件,在预训练的 UNet 模型中通过调节交叉注意力模块确保身份保存。这一过程鼓励文本特征找到前景片段中的最相关视觉线索。 广泛的定量和定性评估表明,与现有方法相比,ID-EA 在身份保持度量方面显著超越了最先进的技术,并且实现了令人瞩目的计算效率,在生成个性化肖像时比现有的方法快大约15倍。
https://arxiv.org/abs/2507.11990
We consider the problem of disentangling 3D from large vision-language models, which we show on generative 3D portraits. This allows free-form text control of appearance attributes like age, hair style, and glasses, and 3D geometry control of face expression and camera pose. In this setting, we assume we use a pre-trained large vision-language model (LVLM; CLIP) to generate from a smaller 2D dataset with no additional paired labels and with a pre-defined 3D morphable model (FLAME). First, we disentangle using canonicalization to a 2D reference frame from a deformable neural 3D triplane representation. But another form of entanglement arises from the significant noise in the LVLM's embedding space that describes irrelevant features. This damages output quality and diversity, but we overcome this with a Jacobian regularization that can be computed efficiently with a stochastic approximator. Compared to existing methods, our approach produces portraits with added text and 3D control, where portraits remain consistent when either control is changed. Broadly, this approach lets creators control 3D generators on their own 2D face data without needing resources to label large data or train large models.
我们考虑从大型视觉-语言模型中分离出三维信息的问题,并通过生成的三维肖像展示了这一过程。这使得我们可以用自由形式的文字控制外观属性(如年龄、发型和眼镜),并通过3D几何学控制面部表情和相机姿态。在这个设置下,假设我们使用一个预训练的大规模视觉-语言模型(LVLM;CLIP)从一个小的2D数据集中生成结果,并且该数据集没有额外配对标签,同时定义了一个预设的三维可变形模型(FLAME)。首先,我们通过将神经3D三平面表示规范到二维参考帧来实现分离。然而,另一种纠缠形式来自于LVLM嵌入空间中的大量噪声,这些噪声描述了无关特征。这种噪声会损害输出质量和多样性,但我们可以通过计算效率高的随机近似器进行雅可比正则化的方法克服这一问题。 与现有方法相比,我们的方法可以在生成的肖像中添加文本和3D控制,当改变任一控制时,肖像仍然保持一致性。总体而言,这种方法让创作者能够在其自己的2D面部数据上控制三维生成模型,并且无需为标注大量数据或训练大型模型投入资源。
https://arxiv.org/abs/2506.14015
This study presents a novel approach to enhance the cost-to-quality ratio of image generation with diffusion models. We hypothesize that differences between distilled (e.g. FLUX.1-schnell) and baseline (e.g. FLUX.1-dev) models are consistent and, therefore, learnable within a specialized domain, like portrait generation. We generate a synthetic paired dataset and train a fast image-to-image translation head. Using two sets of low- and high-quality synthetic images, our model is trained to refine the output of a distilled generator (e.g., FLUX.1-schnell) to a level comparable to a baseline model like FLUX.1-dev, which is more computationally intensive. Our results show that the pipeline, which combines a distilled version of a large generative model with our enhancement layer, delivers similar photorealistic portraits to the baseline version with up to an 82% decrease in computational cost compared to FLUX.1-dev. This study demonstrates the potential for improving the efficiency of AI solutions involving large-scale image generation.
这项研究提出了一种新方法,旨在提高使用扩散模型生成图像的成本与质量比率。我们假设经过蒸馏(例如FLUX.1-schnell)和基准(例如FLUX.1-dev)模型之间的差异在特定领域内是稳定且可学习的,比如人像生成领域。为此,我们生成了一个合成配对数据集,并训练了一个快速图像到图像翻译头。使用两组低质量和高质量的人工合成图像,我们的模型被训练以优化一个蒸馏生成器(例如FLUX.1-schnell)的输出,使其质量接近于计算资源需求更高的基准模型如FLUX.1-dev。 研究结果表明,结合大尺寸生成模型的简化版本与增强层的管道能够提供类似于基线版本的逼真图像,但相比FLUX.1-dev可降低高达82%的计算成本。这项研究表明,在大规模图像生成涉及的人工智能解决方案中,存在提高效率的巨大潜力。
https://arxiv.org/abs/2505.02255
Recent advances in Talking Head Generation (THG) have achieved impressive lip synchronization and visual quality through diffusion models; yet existing methods struggle to generate emotionally expressive portraits while preserving speaker identity. We identify three critical limitations in current emotional talking head generation: insufficient utilization of audio's inherent emotional cues, identity leakage in emotion representations, and isolated learning of emotion correlations. To address these challenges, we propose a novel framework dubbed as DICE-Talk, following the idea of disentangling identity with emotion, and then cooperating emotions with similar characteristics. First, we develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention, representing emotions as identity-agnostic Gaussian distributions. Second, we introduce a correlation-enhanced emotion conditioning module with learnable Emotion Banks that explicitly capture inter-emotion relationships through vector quantization and attention-based feature aggregation. Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process through latent-space classification. Extensive experiments on MEAD and HDTF datasets demonstrate our method's superiority, outperforming state-of-the-art approaches in emotion accuracy while maintaining competitive lip-sync performance. Qualitative results and user studies further confirm our method's ability to generate identity-preserving portraits with rich, correlated emotional expressions that naturally adapt to unseen identities.
最近在 Talking Head Generation (THG) 方面的进展,通过扩散模型实现了令人印象深刻的唇部同步和视觉质量;然而,现有的方法在生成具有情感表达力的同时保持说话者身份方面仍存在困难。我们指出了当前情感面部生成中的三个关键限制:音频中内在的情感线索利用不足、情感表示中的身份泄露以及情感关联的孤立学习。 为了应对这些挑战,我们提出了一种新的框架,称为 DICE-Talk,该框架遵循将身份与情绪解耦然后合作具有相似特征的情绪的理念。首先,我们开发了一个解耦式情感嵌入器,通过跨模态注意力同时建模音频-视觉的情感线索,表示为无身份的高斯分布。其次,我们引入了一种增强关联的情感条件模块,并采用可学习的情感库,该模块通过向量量化和基于注意力的功能聚合明确捕捉了情感之间的关系。第三,我们设计了一个情绪判别目标,在扩散过程中通过潜在空间分类强制执行情感一致性。 在 MEAD 和 HDTF 数据集上的广泛实验表明,我们的方法优于现有最佳方法,在情感准确性方面表现出色的同时保持了竞争性的唇部同步性能。定性结果和用户研究进一步确认了我们方法生成的身份一致且具有丰富相关情感表情的能力,并能够自然适应未见过的身份。
https://arxiv.org/abs/2504.18087
We propose a novel framework for ID-preserving generation using a multi-modal encoding strategy rather than injecting identity features via adapters into pre-trained models. Our method treats identity and text as a unified conditioning input. To achieve this, we introduce FaceCLIP, a multi-modal encoder that learns a joint embedding space for both identity and textual semantics. Given a reference face and a text prompt, FaceCLIP produces a unified representation that encodes both identity and text, which conditions a base diffusion model to generate images that are identity-consistent and text-aligned. We also present a multi-modal alignment algorithm to train FaceCLIP, using a loss that aligns its joint representation with face, text, and image embedding spaces. We then build FaceCLIP-SDXL, an ID-preserving image synthesis pipeline by integrating FaceCLIP with Stable Diffusion XL (SDXL). Compared to prior methods, FaceCLIP-SDXL enables photorealistic portrait generation with better identity preservation and textual relevance. Extensive experiments demonstrate its quantitative and qualitative superiority.
我们提出了一种新颖的框架,用于通过多模态编码策略而非使用适配器向预训练模型注入身份特征来进行保持身份特性的生成。我们的方法将身份和文本视为统一的条件输入。为此,我们引入了FaceCLIP,这是一种多模态编码器,能够为身份和文本语义学习联合嵌入空间。给定一个参考人脸图像和一段文字提示,FaceCLIP可以产生一种同时包含身份信息和文本内容的统一表示形式,这种表示形式能条件化基础扩散模型以生成既符合身份又与文本相关的图像。此外,我们还提出了一种多模态对齐算法来训练FaceCLIP,该算法使用一种损失函数将其联合表示与人脸、文本及图像嵌入空间进行对齐。接着,我们将FaceCLIP与Stable Diffusion XL (SDXL)集成起来构建了FaceCLIP-SDXL,这是一种保持身份特性的图像合成流水线。相比之前的方法,FaceCLIP-SDXL能够生成更逼真的肖像图片,并且在身份保存和文本相关性方面表现更好。广泛的实验表明其具有定量和定性的优越性能。
https://arxiv.org/abs/2504.14202
Creating a realistic animatable avatar from a single static portrait remains challenging. Existing approaches often struggle to capture subtle facial expressions, the associated global body movements, and the dynamic background. To address these limitations, we propose a novel framework that leverages a pretrained video diffusion transformer model to generate high-fidelity, coherent talking portraits with controllable motion dynamics. At the core of our work is a dual-stage audio-visual alignment strategy. In the first stage, we employ a clip-level training scheme to establish coherent global motion by aligning audio-driven dynamics across the entire scene, including the reference portrait, contextual objects, and background. In the second stage, we refine lip movements at the frame level using a lip-tracing mask, ensuring precise synchronization with audio signals. To preserve identity without compromising motion flexibility, we replace the commonly used reference network with a facial-focused cross-attention module that effectively maintains facial consistency throughout the video. Furthermore, we integrate a motion intensity modulation module that explicitly controls expression and body motion intensity, enabling controllable manipulation of portrait movements beyond mere lip motion. Extensive experimental results show that our proposed approach achieves higher quality with better realism, coherence, motion intensity, and identity preservation. Ours project page: this https URL.
从单张静态肖像创建一个逼真的可动画化身仍然具有挑战性。现有方法往往难以捕捉微妙的面部表情、相关的全身动作以及动态背景。为了解决这些限制,我们提出了一种新颖框架,利用预训练的视频扩散变换器模型生成高保真度、连贯的说话头像,并且可以控制运动动力学。我们的工作核心是一种双阶段音频-视觉对齐策略。 在第一阶段,我们采用片段级训练方案,通过在整个场景中(包括参考肖像、上下文对象和背景)对准由音频驱动的动力学来建立连贯的整体运动。在第二阶段,我们使用唇部跟踪掩码以帧为单位细化嘴唇动作,确保与音频信号的精确同步。 为了保持身份一致性而不牺牲运动灵活性,我们将常用的参考网络替换为面部聚焦的跨注意力模块,该模块在整个视频中有效维持面部一致性。此外,我们整合了一个运动强度调节模块,它明确控制表情和身体运动强度,从而实现头像动作(不仅仅是唇部动作)可操控地调整。 广泛的实验结果表明,我们的方法在质量和现实感、连贯性、运动强度和身份保持方面均优于现有技术。 有关我们的项目的更多信息,请访问此链接:[项目页面链接] (请将“this https URL”替换为实际的项目页面URL)。
https://arxiv.org/abs/2504.04842
Generating naturalistic and nuanced listener motions for extended interactions remains an open problem. Existing methods often rely on low-dimensional motion codes for facial behavior generation followed by photorealistic rendering, limiting both visual fidelity and expressive richness. To address these challenges, we introduce DiTaiListener, powered by a video diffusion model with multimodal conditions. Our approach first generates short segments of listener responses conditioned on the speaker's speech and facial motions with DiTaiListener-Gen. It then refines the transitional frames via DiTaiListener-Edit for a seamless transition. Specifically, DiTaiListener-Gen adapts a Diffusion Transformer (DiT) for the task of listener head portrait generation by introducing a Causal Temporal Multimodal Adapter (CTM-Adapter) to process speakers' auditory and visual cues. CTM-Adapter integrates speakers' input in a causal manner into the video generation process to ensure temporally coherent listener responses. For long-form video generation, we introduce DiTaiListener-Edit, a transition refinement video-to-video diffusion model. The model fuses video segments into smooth and continuous videos, ensuring temporal consistency in facial expressions and image quality when merging short video segments produced by DiTaiListener-Gen. Quantitatively, DiTaiListener achieves the state-of-the-art performance on benchmark datasets in both photorealism (+73.8% in FID on RealTalk) and motion representation (+6.1% in FD metric on VICO) spaces. User studies confirm the superior performance of DiTaiListener, with the model being the clear preference in terms of feedback, diversity, and smoothness, outperforming competitors by a significant margin.
生成自然且细腻的听者动作,以支持长时间互动的问题仍然未得到解决。现有方法通常依赖于低维运动代码来生成面部行为,并随后进行逼真的渲染,这既限制了视觉保真度也削弱了表现力的丰富性。为了应对这些挑战,我们引入了DiTaiListener,它是由多模态条件下的视频扩散模型驱动的。我们的方法首先使用DiTaiListener-Gen根据说话人的语音和面部动作生成听众反应的短片段。然后通过DiTaiListener-Edit改进过渡帧以实现无缝连接。 具体来说,DiTaiListener-Gen采用了一种经过改编的Diffusion Transformer(DiT)用于听者头像生成任务,并引入了一个因果时间多模态适配器(CTM-Adapter),用以处理说话人的音频和视觉线索。CTM-Adapter将说话人输入以因果方式整合到视频生成过程中,确保了在产生连贯且一致的听众反应时的时间连续性。 对于长时间视频生成,我们引入了DiTaiListener-Edit,这是一个用于过渡细化的视频到视频扩散模型。该模型融合短片段视频以生成流畅且连贯的长视频,确保在将由DiTaiListener-Gen产生的短视频片段合并后,在面部表情和图像质量方面的时间一致性。 从量化指标来看,DiTaiListener在基准数据集上的表现达到了最先进的水平,分别在逼真度(RealTalk数据集上FID得分提升73.8%)和运动表示能力(VICO数据集上FD指标提高6.1%)。用户研究证实了DiTaiListener的优越性,模型在反馈、多样性和流畅性方面明显优于竞争对手。
https://arxiv.org/abs/2504.04010
Text-to-image diffusion models excel at generating diverse portraits, but lack intuitive shadow control. Existing editing approaches, as post-processing, struggle to offer effective manipulation across diverse styles. Additionally, these methods either rely on expensive real-world light-stage data collection or require extensive computational resources for training. To address these limitations, we introduce Shadow Director, a method that extracts and manipulates hidden shadow attributes within well-trained diffusion models. Our approach uses a small estimation network that requires only a few thousand synthetic images and hours of training-no costly real-world light-stage data needed. Shadow Director enables parametric and intuitive control over shadow shape, placement, and intensity during portrait generation while preserving artistic integrity and identity across diverse styles. Despite training only on synthetic data built on real-world identities, it generalizes effectively to generated portraits with diverse styles, making it a more accessible and resource-friendly solution.
文本到图像的扩散模型在生成多样的肖像方面表现出色,但缺乏直观的阴影控制。现有的编辑方法作为后处理手段,在处理不同风格时难以提供有效的操作。此外,这些方法要么依赖于昂贵的真实世界光舞台数据收集,要么需要大量的计算资源进行训练。为了解决这些问题,我们介绍了Shadow Director方法,该方法可以从已经训练好的扩散模型中提取并操纵隐藏的阴影属性。我们的方法使用一个小型估计网络,只需要几千张合成图像和几个小时的训练时间——无需昂贵的真实世界光舞台数据。 Shadow Director在生成肖像时提供了参数化且直观的阴影形状、位置及强度控制,并能在保持艺术完整性和身份一致性的前提下应用于各种风格中。尽管仅基于真实世界的身份构建并经过少量合成数据训练,它仍然能够有效地推广到具有多样风格的生成肖像上,使其成为一个更易于使用和资源友好的解决方案。
https://arxiv.org/abs/2503.21943
Personalized portrait synthesis, essential in domains like social entertainment, has recently made significant progress. Person-wise fine-tuning based methods, such as LoRA and DreamBooth, can produce photorealistic outputs but need training on individual samples, consuming time and resources and posing an unstable risk. Adapter based techniques such as IP-Adapter freeze the foundational model parameters and employ a plug-in architecture to enable zero-shot inference, but they often exhibit a lack of naturalness and authenticity, which are not to be overlooked in portrait synthesis tasks. In this paper, we introduce a parameter-efficient adaptive generation method, namely HyperLoRA, that uses an adaptive plug-in network to generate LoRA weights, merging the superior performance of LoRA with the zero-shot capability of adapter scheme. Through our carefully designed network structure and training strategy, we achieve zero-shot personalized portrait generation (supporting both single and multiple image inputs) with high photorealism, fidelity, and editability.
个人肖像合成技术在社交娱乐等领域中至关重要,最近取得了显著进展。基于个人微调的方法(如LoRA和DreamBooth)可以生成逼真的图像输出,但需要对每个样本进行训练,这会消耗大量时间和资源,并且存在不稳定的隐患。而基于适配器的技术(例如IP-Adapter),冻结基础模型参数并采用插件架构以实现零样本推理,但在肖像合成任务中往往缺乏自然感和真实性。 在本文中,我们提出了一种参数高效的自适应生成方法——HyperLoRA,该方法使用一个自适应的插件网络来生成LoRA权重,从而结合了LoRA的优越性能与适配器方案的零样本推理能力。通过精心设计的网络结构和训练策略,我们的方法能够实现高逼真度、保真度及可编辑性的零样本个性化肖像生成(支持单图或多图输入)。
https://arxiv.org/abs/2503.16944
Audio-driven single-image talking portrait generation plays a crucial role in virtual reality, digital human creation, and filmmaking. Existing approaches are generally categorized into keypoint-based and image-based methods. Keypoint-based methods effectively preserve character identity but struggle to capture fine facial details due to the fixed points limitation of the 3D Morphable Model. Moreover, traditional generative networks face challenges in establishing causality between audio and keypoints on limited datasets, resulting in low pose diversity. In contrast, image-based approaches produce high-quality portraits with diverse details using the diffusion network but incur identity distortion and expensive computational costs. In this work, we propose KDTalker, the first framework to combine unsupervised implicit 3D keypoint with a spatiotemporal diffusion model. Leveraging unsupervised implicit 3D keypoints, KDTalker adapts facial information densities, allowing the diffusion process to model diverse head poses and capture fine facial details flexibly. The custom-designed spatiotemporal attention mechanism ensures accurate lip synchronization, producing temporally consistent, high-quality animations while enhancing computational efficiency. Experimental results demonstrate that KDTalker achieves state-of-the-art performance regarding lip synchronization accuracy, head pose diversity, and execution this http URL codes are available at this https URL.
音频驱动的单张图像说话肖像生成在虚拟现实、数字人创建和电影制作中扮演着重要角色。现有的方法通常被分为基于关键点的方法和基于图像的方法。基于关键点的方法有效地保留了人物的身份,但因3D可变形模型(3D Morphable Model)固定点限制而难以捕捉到精细的面部细节。此外,传统的生成网络在使用有限数据集建立音频与关键点之间的因果关系时面临挑战,导致姿势多样性较低。相比之下,基于图像的方法通过扩散网络能够产生具有丰富细节和高质量的人物肖像,但会导致身份失真,并且计算成本高昂。 在这项工作中,我们提出了KDTalker框架,这是第一个结合无监督隐式3D关键点与时空扩散模型的框架。利用无监督隐式的3D关键点,KDTalker可以根据面部信息密度调整面部信息,使得扩散过程能够灵活地模拟多样的头部姿势,并捕捉到细微的面部细节。此外,自定义设计的时空注意力机制确保了准确的唇部同步,在提高计算效率的同时产生了时间一致性高、高质量的动画。 实验结果表明,KDTalker在唇部同步准确性、头部姿态多样性以及执行效率方面均达到了最先进的水平。代码可在提供的链接中获取。
https://arxiv.org/abs/2503.12963
Recent Customized Portrait Generation (CPG) methods, taking a facial image and a textual prompt as inputs, have attracted substantial attention. Although these methods generate high-fidelity portraits, they fail to prevent the generated portraits from being tracked and misused by malicious face recognition systems. To address this, this paper proposes a Customized Portrait Generation framework with facial Adversarial attacks (Adv-CPG). Specifically, to achieve facial privacy protection, we devise a lightweight local ID encryptor and an encryption enhancer. They implement progressive double-layer encryption protection by directly injecting the target identity and adding additional identity guidance, respectively. Furthermore, to accomplish fine-grained and personalized portrait generation, we develop a multi-modal image customizer capable of generating controlled fine-grained facial features. To the best of our knowledge, Adv-CPG is the first study that introduces facial adversarial attacks into CPG. Extensive experiments demonstrate the superiority of Adv-CPG, e.g., the average attack success rate of the proposed Adv-CPG is 28.1% and 2.86% higher compared to the SOTA noise-based attack methods and unconstrained attack methods, respectively.
最近定制化肖像生成(CPG)方法,通过面部图像和文本提示作为输入,引起了广泛关注。尽管这些方法能够生成高保真的面部画像,但它们无法防止生成的画像被恶意人脸识别系统追踪并滥用。为了解决这个问题,本文提出了一种结合了面部对抗攻击的定制化肖像生成框架(Adv-CPG)。为了实现面部隐私保护,我们设计了一个轻量级的身份加密器和一个增强器。身份加密器通过直接注入目标身份信息来实施渐进式的双层加密保护;而增强器则是通过添加额外的身份指导来加强这一过程。 此外,为了完成精细且个性化的肖像生成,我们开发了一种多模态图像定制器,能够生成受控的细粒度面部特征。据我们所知,Adv-CPG是第一个将面部对抗攻击引入到CPG中的研究工作。广泛的实验展示了Adv-CPG的优势,例如:提出的Adv-CPG方法平均攻击成功率达到了28.1%,比最先进的基于噪声的方法高出2.86%。 翻译成中文如下: 最近定制化肖像生成(Customized Portrait Generation, CPG)方法因其能够利用面部图像和文本提示作为输入而吸引了大量关注。尽管这些方法可以生成高保真的面部画像,但它们无法防止生成的画像被恶意的人脸识别系统追踪并滥用。为了解决这个问题,本文提出了一种结合了面部对抗攻击(Adversarial attacks, Adv)的定制化肖像生成框架(Adv-CPG)。为了实现面部隐私保护,我们设计了一个轻量级的身份加密器和一个增强器。身份加密器通过直接注入目标身份信息来实施渐进式的双层加密保护;而增强器则是通过添加额外的身份指导来加强这一过程。 此外,为了完成精细且个性化的肖像生成,我们开发了一种多模态图像定制器,能够生成受控的细粒度面部特征。据我们所知,Adv-CPG是第一个将面部对抗攻击引入到CPG中的研究工作。广泛的实验展示了Adv-CPG的优势,例如:提出的Adv-CPG方法平均攻击成功率达到了28.1%,比最先进的基于噪声的方法高出2.86%。
https://arxiv.org/abs/2503.08269
Text-to-image generative models have shown remarkable progress in producing diverse and photorealistic outputs. In this paper, we present a comprehensive analysis of their effectiveness in creating synthetic portraits that accurately represent various demographic attributes, with a special focus on age, nationality, and gender. Our evaluation employs prompts specifying detailed profiles (e.g., Photorealistic selfie photo of a 32-year-old Canadian male), covering a broad spectrum of 212 nationalities, 30 distinct ages from 10 to 78, and balanced gender representation. We compare the generated images against ground truth age estimates from two established age estimation models to assess how faithfully age is depicted. Our findings reveal that although text-to-image models can consistently generate faces reflecting different identities, the accuracy with which they capture specific ages and do so across diverse demographic backgrounds remains highly variable. These results suggest that current synthetic data may be insufficiently reliable for high-stakes age-related tasks requiring robust precision, unless practitioners are prepared to invest in significant filtering and curation. Nevertheless, they may still be useful in less sensitive or exploratory applications, where absolute age precision is not critical.
文本到图像的生成模型在产生多样性和照片般逼真的输出方面取得了显著进展。在这篇论文中,我们对这些模型在创建能够准确反映各种人口统计特征(特别是年龄、国籍和性别)的人工肖像方面的有效性进行了全面分析。我们的评估使用了指定详细个人资料的提示语(例如,“一个32岁的加拿大男性的现实自拍照”),涵盖了212个不同国家,从10岁到78岁的30种不同的年龄段,并且性别比例均衡。我们通过与两个已建立的年龄估计模型提供的真实年龄估计进行比较来评估生成图像中年龄描绘的真实程度。 我们的研究发现表明,尽管文本到图像的模型能够持续生成反映不同身份特征的脸部图像,但它们捕捉特定年龄以及在多样化的人口统计背景下的准确性仍然存在很大的变异性。这些结果暗示当前合成数据可能不足以用于需要高度精确度的关键性年龄相关任务,除非从业者愿意投入大量精力进行过滤和筛选工作。然而,在不敏感或探索性的应用中,即使绝对的年龄精度不是关键因素,它们仍可能具有一定的实用性。
https://arxiv.org/abs/2502.03420
Existing diffusion models show great potential for identity-preserving generation. However, personalized portrait generation remains challenging due to the diversity in user profiles, including variations in appearance and lighting conditions. To address these challenges, we propose IC-Portrait, a novel framework designed to accurately encode individual identities for personalized portrait generation. Our key insight is that pre-trained diffusion models are fast learners (e.g.,100 ~ 200 steps) for in-context dense correspondence matching, which motivates the two major designs of our IC-Portrait framework. Specifically, we reformulate portrait generation into two sub-tasks: 1) Lighting-Aware Stitching: we find that masking a high proportion of the input image, e.g., 80%, yields a highly effective self-supervisory representation learning of reference image lighting. 2) View-Consistent Adaptation: we leverage a synthetic view-consistent profile dataset to learn the in-context correspondence. The reference profile can then be warped into arbitrary poses for strong spatial-aligned view conditioning. Coupling these two designs by simply concatenating latents to form ControlNet-like supervision and modeling, enables us to significantly enhance the identity preservation fidelity and stability. Extensive evaluations demonstrate that IC-Portrait consistently outperforms existing state-of-the-art methods both quantitatively and qualitatively, with particularly notable improvements in visual qualities. Furthermore, IC-Portrait even demonstrates 3D-aware relighting capabilities.
现有的扩散模型在保持身份特性的生成方面展现出了巨大的潜力。然而,由于用户资料的多样性(包括外观和光照条件的变化),个性化肖像生成仍然面临挑战。为了解决这些问题,我们提出了IC-Portrait,这是一个全新的框架,旨在准确编码个人身份以进行个性化的肖像生成。我们的关键见解是,预训练的扩散模型在上下文中密集对应匹配方面学习迅速(例如100到200步),这激励了我们IC-Portrait框架的主要设计。 具体而言,我们将肖像生成重新表述为两个子任务: 1. **光照感知拼接**:我们发现对输入图像进行高度遮挡处理(例如80%),可以非常有效地学习参考图的自监督表示,从而更好地理解光照条件。 2. **视图一致性适应**:利用合成的视图一致资料集来学习上下文中的对应关系。这使得参考轮廓能够被变形到任意姿势,从而提供强大的空间对齐视图调节。 通过简单地连接潜在变量形成类似ControlNet的监督和建模方式,将这两种设计结合起来,我们显著增强了身份保持的准确度和稳定性。广泛的评估显示,IC-Portrait在定量和定性评价中均优于现有的最先进的方法,并且在视觉质量方面有了特别明显的改进。此外,IC-Portrait还展示了3D感知重光照的能力。
https://arxiv.org/abs/2501.17159
This paper aims to bring fine-grained expression control to identity-preserving portrait generation. Existing methods tend to synthesize portraits with either neutral or stereotypical expressions. Even when supplemented with control signals like facial landmarks, these models struggle to generate accurate and vivid expressions following user instructions. To solve this, we introduce EmojiDiff, an end-to-end solution to facilitate simultaneous dual control of fine expression and identity. Unlike the conventional methods using coarse control signals, our method directly accepts RGB expression images as input templates to provide extremely accurate and fine-grained expression control in the diffusion process. As its core, an innovative decoupled scheme is proposed to disentangle expression features in the expression template from other extraneous information, such as identity, skin, and style. On one hand, we introduce \textbf{I}D-irrelevant \textbf{D}ata \textbf{I}teration (IDI) to synthesize extremely high-quality cross-identity expression pairs for decoupled training, which is the crucial foundation to filter out identity information hidden in the expressions. On the other hand, we meticulously investigate network layer function and select expression-sensitive layers to inject reference expression features, effectively preventing style leakage from expression signals. To further improve identity fidelity, we propose a novel fine-tuning strategy named \textbf{I}D-enhanced \textbf{C}ontrast \textbf{A}lignment (ICA), which eliminates the negative impact of expression control on original identity preservation. Experimental results demonstrate that our method remarkably outperforms counterparts, achieves precise expression control with highly maintained identity, and generalizes well to various diffusion models.
https://arxiv.org/abs/2412.01254
Recent diffusion-based Single-image 3D portrait generation methods typically employ 2D diffusion models to provide multi-view knowledge, which is then distilled into 3D representations. However, these methods usually struggle to produce high-fidelity 3D models, frequently yielding excessively blurred textures. We attribute this issue to the insufficient consideration of cross-view consistency during the diffusion process, resulting in significant disparities between different views and ultimately leading to blurred 3D representations. In this paper, we address this issue by comprehensively exploiting multi-view priors in both the conditioning and diffusion procedures to produce consistent, detail-rich portraits. From the conditioning standpoint, we propose a Hybrid Priors Diffsion model, which explicitly and implicitly incorporates multi-view priors as conditions to enhance the status consistency of the generated multi-view portraits. From the diffusion perspective, considering the significant impact of the diffusion noise distribution on detailed texture generation, we propose a Multi-View Noise Resamplig Strategy integrated within the optimization process leveraging cross-view priors to enhance representation consistency. Extensive experiments demonstrate that our method can produce 3D portraits with accurate geometry and rich details from a single image. The project page is at \url{this https URL}.
近期基于扩散的单图像三维肖像生成方法通常采用二维扩散模型来提供多视角知识,然后将这些信息提炼成三维表示。然而,这些方法常常难以生成高保真的三维模型,经常导致纹理过度模糊。我们把这个问题归因于在扩散过程中对跨视图一致性考虑不足,这导致不同视图之间存在显著差异,并最终使得三维表示变得模糊。在这篇论文中,我们通过全面利用条件设置和扩散过程中的多视角先验来解决这一问题,从而生成一致且细节丰富的肖像。从条件设定的角度来看,我们提出了一种混合先验扩散模型,该模型显式和隐式地将多视图先验作为条件,以增强生成的多视图肖像的状态一致性。从扩散角度看,考虑到扩散噪声分布对详细纹理生成的重大影响,我们提出了一个结合跨视角先验并整合在优化过程中的多视角噪声重采样策略,用以增强表示的一致性。大量的实验表明,我们的方法可以从单张图像中生成具有准确几何结构和丰富细节的三维肖像。该项目页面位于 \url{this https URL}。
https://arxiv.org/abs/2411.10369