Audio-driven single-image talking portrait generation plays a crucial role in virtual reality, digital human creation, and filmmaking. Existing approaches are generally categorized into keypoint-based and image-based methods. Keypoint-based methods effectively preserve character identity but struggle to capture fine facial details due to the fixed points limitation of the 3D Morphable Model. Moreover, traditional generative networks face challenges in establishing causality between audio and keypoints on limited datasets, resulting in low pose diversity. In contrast, image-based approaches produce high-quality portraits with diverse details using the diffusion network but incur identity distortion and expensive computational costs. In this work, we propose KDTalker, the first framework to combine unsupervised implicit 3D keypoint with a spatiotemporal diffusion model. Leveraging unsupervised implicit 3D keypoints, KDTalker adapts facial information densities, allowing the diffusion process to model diverse head poses and capture fine facial details flexibly. The custom-designed spatiotemporal attention mechanism ensures accurate lip synchronization, producing temporally consistent, high-quality animations while enhancing computational efficiency. Experimental results demonstrate that KDTalker achieves state-of-the-art performance regarding lip synchronization accuracy, head pose diversity, and execution this http URL codes are available at this https URL.
音频驱动的单张图像说话肖像生成在虚拟现实、数字人创建和电影制作中扮演着重要角色。现有的方法通常被分为基于关键点的方法和基于图像的方法。基于关键点的方法有效地保留了人物的身份,但因3D可变形模型(3D Morphable Model)固定点限制而难以捕捉到精细的面部细节。此外,传统的生成网络在使用有限数据集建立音频与关键点之间的因果关系时面临挑战,导致姿势多样性较低。相比之下,基于图像的方法通过扩散网络能够产生具有丰富细节和高质量的人物肖像,但会导致身份失真,并且计算成本高昂。 在这项工作中,我们提出了KDTalker框架,这是第一个结合无监督隐式3D关键点与时空扩散模型的框架。利用无监督隐式的3D关键点,KDTalker可以根据面部信息密度调整面部信息,使得扩散过程能够灵活地模拟多样的头部姿势,并捕捉到细微的面部细节。此外,自定义设计的时空注意力机制确保了准确的唇部同步,在提高计算效率的同时产生了时间一致性高、高质量的动画。 实验结果表明,KDTalker在唇部同步准确性、头部姿态多样性以及执行效率方面均达到了最先进的水平。代码可在提供的链接中获取。
https://arxiv.org/abs/2503.12963
Recent Customized Portrait Generation (CPG) methods, taking a facial image and a textual prompt as inputs, have attracted substantial attention. Although these methods generate high-fidelity portraits, they fail to prevent the generated portraits from being tracked and misused by malicious face recognition systems. To address this, this paper proposes a Customized Portrait Generation framework with facial Adversarial attacks (Adv-CPG). Specifically, to achieve facial privacy protection, we devise a lightweight local ID encryptor and an encryption enhancer. They implement progressive double-layer encryption protection by directly injecting the target identity and adding additional identity guidance, respectively. Furthermore, to accomplish fine-grained and personalized portrait generation, we develop a multi-modal image customizer capable of generating controlled fine-grained facial features. To the best of our knowledge, Adv-CPG is the first study that introduces facial adversarial attacks into CPG. Extensive experiments demonstrate the superiority of Adv-CPG, e.g., the average attack success rate of the proposed Adv-CPG is 28.1% and 2.86% higher compared to the SOTA noise-based attack methods and unconstrained attack methods, respectively.
最近定制化肖像生成(CPG)方法,通过面部图像和文本提示作为输入,引起了广泛关注。尽管这些方法能够生成高保真的面部画像,但它们无法防止生成的画像被恶意人脸识别系统追踪并滥用。为了解决这个问题,本文提出了一种结合了面部对抗攻击的定制化肖像生成框架(Adv-CPG)。为了实现面部隐私保护,我们设计了一个轻量级的身份加密器和一个增强器。身份加密器通过直接注入目标身份信息来实施渐进式的双层加密保护;而增强器则是通过添加额外的身份指导来加强这一过程。 此外,为了完成精细且个性化的肖像生成,我们开发了一种多模态图像定制器,能够生成受控的细粒度面部特征。据我们所知,Adv-CPG是第一个将面部对抗攻击引入到CPG中的研究工作。广泛的实验展示了Adv-CPG的优势,例如:提出的Adv-CPG方法平均攻击成功率达到了28.1%,比最先进的基于噪声的方法高出2.86%。 翻译成中文如下: 最近定制化肖像生成(Customized Portrait Generation, CPG)方法因其能够利用面部图像和文本提示作为输入而吸引了大量关注。尽管这些方法可以生成高保真的面部画像,但它们无法防止生成的画像被恶意的人脸识别系统追踪并滥用。为了解决这个问题,本文提出了一种结合了面部对抗攻击(Adversarial attacks, Adv)的定制化肖像生成框架(Adv-CPG)。为了实现面部隐私保护,我们设计了一个轻量级的身份加密器和一个增强器。身份加密器通过直接注入目标身份信息来实施渐进式的双层加密保护;而增强器则是通过添加额外的身份指导来加强这一过程。 此外,为了完成精细且个性化的肖像生成,我们开发了一种多模态图像定制器,能够生成受控的细粒度面部特征。据我们所知,Adv-CPG是第一个将面部对抗攻击引入到CPG中的研究工作。广泛的实验展示了Adv-CPG的优势,例如:提出的Adv-CPG方法平均攻击成功率达到了28.1%,比最先进的基于噪声的方法高出2.86%。
https://arxiv.org/abs/2503.08269
Text-to-image generative models have shown remarkable progress in producing diverse and photorealistic outputs. In this paper, we present a comprehensive analysis of their effectiveness in creating synthetic portraits that accurately represent various demographic attributes, with a special focus on age, nationality, and gender. Our evaluation employs prompts specifying detailed profiles (e.g., Photorealistic selfie photo of a 32-year-old Canadian male), covering a broad spectrum of 212 nationalities, 30 distinct ages from 10 to 78, and balanced gender representation. We compare the generated images against ground truth age estimates from two established age estimation models to assess how faithfully age is depicted. Our findings reveal that although text-to-image models can consistently generate faces reflecting different identities, the accuracy with which they capture specific ages and do so across diverse demographic backgrounds remains highly variable. These results suggest that current synthetic data may be insufficiently reliable for high-stakes age-related tasks requiring robust precision, unless practitioners are prepared to invest in significant filtering and curation. Nevertheless, they may still be useful in less sensitive or exploratory applications, where absolute age precision is not critical.
文本到图像的生成模型在产生多样性和照片般逼真的输出方面取得了显著进展。在这篇论文中,我们对这些模型在创建能够准确反映各种人口统计特征(特别是年龄、国籍和性别)的人工肖像方面的有效性进行了全面分析。我们的评估使用了指定详细个人资料的提示语(例如,“一个32岁的加拿大男性的现实自拍照”),涵盖了212个不同国家,从10岁到78岁的30种不同的年龄段,并且性别比例均衡。我们通过与两个已建立的年龄估计模型提供的真实年龄估计进行比较来评估生成图像中年龄描绘的真实程度。 我们的研究发现表明,尽管文本到图像的模型能够持续生成反映不同身份特征的脸部图像,但它们捕捉特定年龄以及在多样化的人口统计背景下的准确性仍然存在很大的变异性。这些结果暗示当前合成数据可能不足以用于需要高度精确度的关键性年龄相关任务,除非从业者愿意投入大量精力进行过滤和筛选工作。然而,在不敏感或探索性的应用中,即使绝对的年龄精度不是关键因素,它们仍可能具有一定的实用性。
https://arxiv.org/abs/2502.03420
Existing diffusion models show great potential for identity-preserving generation. However, personalized portrait generation remains challenging due to the diversity in user profiles, including variations in appearance and lighting conditions. To address these challenges, we propose IC-Portrait, a novel framework designed to accurately encode individual identities for personalized portrait generation. Our key insight is that pre-trained diffusion models are fast learners (e.g.,100 ~ 200 steps) for in-context dense correspondence matching, which motivates the two major designs of our IC-Portrait framework. Specifically, we reformulate portrait generation into two sub-tasks: 1) Lighting-Aware Stitching: we find that masking a high proportion of the input image, e.g., 80%, yields a highly effective self-supervisory representation learning of reference image lighting. 2) View-Consistent Adaptation: we leverage a synthetic view-consistent profile dataset to learn the in-context correspondence. The reference profile can then be warped into arbitrary poses for strong spatial-aligned view conditioning. Coupling these two designs by simply concatenating latents to form ControlNet-like supervision and modeling, enables us to significantly enhance the identity preservation fidelity and stability. Extensive evaluations demonstrate that IC-Portrait consistently outperforms existing state-of-the-art methods both quantitatively and qualitatively, with particularly notable improvements in visual qualities. Furthermore, IC-Portrait even demonstrates 3D-aware relighting capabilities.
现有的扩散模型在保持身份特性的生成方面展现出了巨大的潜力。然而,由于用户资料的多样性(包括外观和光照条件的变化),个性化肖像生成仍然面临挑战。为了解决这些问题,我们提出了IC-Portrait,这是一个全新的框架,旨在准确编码个人身份以进行个性化的肖像生成。我们的关键见解是,预训练的扩散模型在上下文中密集对应匹配方面学习迅速(例如100到200步),这激励了我们IC-Portrait框架的主要设计。 具体而言,我们将肖像生成重新表述为两个子任务: 1. **光照感知拼接**:我们发现对输入图像进行高度遮挡处理(例如80%),可以非常有效地学习参考图的自监督表示,从而更好地理解光照条件。 2. **视图一致性适应**:利用合成的视图一致资料集来学习上下文中的对应关系。这使得参考轮廓能够被变形到任意姿势,从而提供强大的空间对齐视图调节。 通过简单地连接潜在变量形成类似ControlNet的监督和建模方式,将这两种设计结合起来,我们显著增强了身份保持的准确度和稳定性。广泛的评估显示,IC-Portrait在定量和定性评价中均优于现有的最先进的方法,并且在视觉质量方面有了特别明显的改进。此外,IC-Portrait还展示了3D感知重光照的能力。
https://arxiv.org/abs/2501.17159
This paper aims to bring fine-grained expression control to identity-preserving portrait generation. Existing methods tend to synthesize portraits with either neutral or stereotypical expressions. Even when supplemented with control signals like facial landmarks, these models struggle to generate accurate and vivid expressions following user instructions. To solve this, we introduce EmojiDiff, an end-to-end solution to facilitate simultaneous dual control of fine expression and identity. Unlike the conventional methods using coarse control signals, our method directly accepts RGB expression images as input templates to provide extremely accurate and fine-grained expression control in the diffusion process. As its core, an innovative decoupled scheme is proposed to disentangle expression features in the expression template from other extraneous information, such as identity, skin, and style. On one hand, we introduce \textbf{I}D-irrelevant \textbf{D}ata \textbf{I}teration (IDI) to synthesize extremely high-quality cross-identity expression pairs for decoupled training, which is the crucial foundation to filter out identity information hidden in the expressions. On the other hand, we meticulously investigate network layer function and select expression-sensitive layers to inject reference expression features, effectively preventing style leakage from expression signals. To further improve identity fidelity, we propose a novel fine-tuning strategy named \textbf{I}D-enhanced \textbf{C}ontrast \textbf{A}lignment (ICA), which eliminates the negative impact of expression control on original identity preservation. Experimental results demonstrate that our method remarkably outperforms counterparts, achieves precise expression control with highly maintained identity, and generalizes well to various diffusion models.
https://arxiv.org/abs/2412.01254
Recent diffusion-based Single-image 3D portrait generation methods typically employ 2D diffusion models to provide multi-view knowledge, which is then distilled into 3D representations. However, these methods usually struggle to produce high-fidelity 3D models, frequently yielding excessively blurred textures. We attribute this issue to the insufficient consideration of cross-view consistency during the diffusion process, resulting in significant disparities between different views and ultimately leading to blurred 3D representations. In this paper, we address this issue by comprehensively exploiting multi-view priors in both the conditioning and diffusion procedures to produce consistent, detail-rich portraits. From the conditioning standpoint, we propose a Hybrid Priors Diffsion model, which explicitly and implicitly incorporates multi-view priors as conditions to enhance the status consistency of the generated multi-view portraits. From the diffusion perspective, considering the significant impact of the diffusion noise distribution on detailed texture generation, we propose a Multi-View Noise Resamplig Strategy integrated within the optimization process leveraging cross-view priors to enhance representation consistency. Extensive experiments demonstrate that our method can produce 3D portraits with accurate geometry and rich details from a single image. The project page is at \url{this https URL}.
近期基于扩散的单图像三维肖像生成方法通常采用二维扩散模型来提供多视角知识,然后将这些信息提炼成三维表示。然而,这些方法常常难以生成高保真的三维模型,经常导致纹理过度模糊。我们把这个问题归因于在扩散过程中对跨视图一致性考虑不足,这导致不同视图之间存在显著差异,并最终使得三维表示变得模糊。在这篇论文中,我们通过全面利用条件设置和扩散过程中的多视角先验来解决这一问题,从而生成一致且细节丰富的肖像。从条件设定的角度来看,我们提出了一种混合先验扩散模型,该模型显式和隐式地将多视图先验作为条件,以增强生成的多视图肖像的状态一致性。从扩散角度看,考虑到扩散噪声分布对详细纹理生成的重大影响,我们提出了一个结合跨视角先验并整合在优化过程中的多视角噪声重采样策略,用以增强表示的一致性。大量的实验表明,我们的方法可以从单张图像中生成具有准确几何结构和丰富细节的三维肖像。该项目页面位于 \url{this https URL}。
https://arxiv.org/abs/2411.10369
We propose MegaPortrait. It's an innovative system for creating personalized portrait images in computer vision. It has three modules: Identity Net, Shading Net, and Harmonization Net. Identity Net generates learned identity using a customized model fine-tuned with source images. Shading Net re-renders portraits using extracted representations. Harmonization Net fuses pasted faces and the reference image's body for coherent results. Our approach with off-the-shelf Controlnets is better than state-of-the-art AI portrait products in identity preservation and image fidelity. MegaPortrait has a simple but effective design and we compare it with other methods and products to show its superiority.
我们提出了一种名为MegaPortrait的系统,这是一种用于计算机视觉中创建个性化肖像图像的创新性系统。该系统包括三个模块:Identity Net(身份网络)、Shading Net(阴影网络)和Harmonization Net(和谐网络)。Identity Net利用经过源图像微调的定制模型生成学习到的身份特征。Shading Net使用提取的表示重新渲染肖像。Harmonization Net融合粘贴的脸部与参考图像的身体部分,以获得一致的结果。我们的方法结合现成的Controlnets,在身份保持和图像保真度方面优于现有的AI肖像产品。MegaPortrait的设计简单而有效,并且我们将其与其他方法和产品进行了比较,证明了其优越性。
https://arxiv.org/abs/2411.04357
In the field of human-centric personalized image generation, the adapter-based method obtains the ability to customize and generate portraits by text-to-image training on facial data. This allows for identity-preserved personalization without additional fine-tuning in inference. Although there are improvements in efficiency and fidelity, there is often a significant performance decrease in test following ability, controllability, and diversity of generated faces compared to the base model. In this paper, we analyze that the performance degradation is attributed to the failure to decouple identity features from other attributes during extraction, as well as the failure to decouple the portrait generation training from the overall generation task. To address these issues, we propose the Face Adapter with deCoupled Training (FACT) framework, focusing on both model architecture and training strategy. To decouple identity features from others, we leverage a transformer-based face-export encoder and harness fine-grained identity features. To decouple the portrait generation training, we propose Face Adapting Increment Regularization~(FAIR), which effectively constrains the effect of face adapters on the facial region, preserving the generative ability of the base model. Additionally, we incorporate a face condition drop and shuffle mechanism, combined with curriculum learning, to enhance facial controllability and diversity. As a result, FACT solely learns identity preservation from training data, thereby minimizing the impact on the original text-to-image capabilities of the base model. Extensive experiments show that FACT has both controllability and fidelity in both text-to-image generation and inpainting solutions for portrait generation.
https://arxiv.org/abs/2410.12312
Portrait Fidelity Generation is a prominent research area in generative models, with a primary focus on enhancing both controllability and fidelity. Current methods face challenges in generating high-fidelity portrait results when faces occupy a small portion of the image with a low resolution, especially in multi-person group photo settings. To tackle these issues, we propose a systematic solution called MagicID, based on a self-constructed million-level multi-modal dataset named IDZoom. MagicID consists of Multi-Mode Fusion training strategy (MMF) and DDIM Inversion based ID Restoration inference framework (DIIR). During training, MMF iteratively uses the skeleton and landmark modalities from IDZoom as conditional guidance. By introducing the Clone Face Tuning in training stage and Mask Guided Multi-ID Cross Attention (MGMICA) in inference stage, explicit constraints on face positional features are achieved for multi-ID group photo generation. The DIIR aims to address the issue of artifacts. The DDIM Inversion is used in conjunction with face landmarks, global and local face features to achieve face restoration while keeping the background unchanged. Additionally, DIIR is plug-and-play and can be applied to any diffusion-based portrait generation method. To validate the effectiveness of MagicID, we conducted extensive comparative and ablation experiments. The experimental results demonstrate that MagicID has significant advantages in both subjective and objective metrics, and achieves controllable generation in multi-person scenarios.
肖像可靠性生成是一个在生成模型领域突出研究领域的知名领域,主要关注提高可控制性和可靠性。当面对低分辨率时,当前方法在生成高保真度肖像结果时存在挑战,尤其是在多人物群照设置中。为解决这些问题,我们提出了一个基于自构建的100万级多模态数据集IDZoom的系统解决方案,称为MagicID。MagicID包括多模态融合训练策略(MMF)和基于ID修复的DDIM Inversion。在训练过程中,MMF使用IDZoom中的骨架和关键点模态作为条件指导。通过引入训练阶段的人脸克隆和引导多ID照Cross注意力(MGMICA)以及在推理阶段使用DDIM Inversion,实现了对面部位置特征的显式约束,用于多ID照群照生成。 DIIR旨在解决伪影问题。将DDIM Inversion与面部关键点、全局和局部面部特征相结合,可以在保留背景不变的情况下实现面部修复。此外,DIIR是可插拔的,可以应用于任何基于扩散的肖像生成方法。为了验证MagicID的有效性,我们进行了广泛的比较和消融实验。实验结果表明,MagicID在主观和客观指标上具有显著优势,并在多人物群照场景中实现了可控制性的生成。
https://arxiv.org/abs/2408.09248
Customized image generation, which seeks to synthesize images with consistent characters, holds significant relevance for applications such as storytelling, portrait generation, and character design. However, previous approaches have encountered challenges in preserving characters with high-fidelity consistency due to inadequate feature extraction and concept confusion of reference characters. Therefore, we propose Character-Adapter, a plug-and-play framework designed to generate images that preserve the details of reference characters, ensuring high-fidelity consistency. Character-Adapter employs prompt-guided segmentation to ensure fine-grained regional features of reference characters and dynamic region-level adapters to mitigate concept confusion. Extensive experiments are conducted to validate the effectiveness of Character-Adapter. Both quantitative and qualitative results demonstrate that Character-Adapter achieves the state-of-the-art performance of consistent character generation, with an improvement of 24.8% compared with other methods
定制图像生成,旨在合成具有一致性字符的图像,对于诸如叙事、肖像生成和角色设计等应用具有重要的相关性。然而,由于参考角色的特征提取不足和参考角色概念混淆,以前的方法在保留高保真度特征方面遇到了挑战。因此,我们提出了Character-Adapter,一个可插拔和使用的框架,旨在生成保留参考角色详细信息的图像,确保高保真度一致性。Character-Adapter采用提示引导分割来确保参考角色的细粒度区域特征,并使用动态区域级别适应器来缓解概念混淆。 extensive实验验证了Character-Adapter的有效性。量化结果表明,Character-Adapter实现了与其他方法相当的最佳性能,性能提高了24.8%。
https://arxiv.org/abs/2406.16537
Talking head synthesis, an advanced method for generating portrait videos from a still image driven by specific content, has garnered widespread attention in virtual reality, augmented reality and game production. Recently, significant breakthroughs have been made with the introduction of novel models such as the transformer and the diffusion model. Current methods can not only generate new content but also edit the generated material. This survey systematically reviews the technology, categorizing it into three pivotal domains: portrait generation, driven mechanisms, and editing techniques. We summarize milestone studies and critically analyze their innovations and shortcomings within each domain. Additionally, we organize an extensive collection of datasets and provide a thorough performance analysis of current methodologies based on various evaluation metrics, aiming to furnish a clear framework and robust data support for future research. Finally, we explore application scenarios of talking head synthesis, illustrate them with specific cases, and examine potential future directions.
谈话头合成是一种先进的方法,用于从特定内容驱动的静态图像中生成视频肖像。它在虚拟现实、增强现实和游戏制作领域引起了广泛关注。近年来,随着新模型的引入,如Transformer和扩散模型,取得了显著的突破。现有的方法不仅能生成新的内容,还可以编辑生成材料。本调查系统地回顾了该技术,将其分为三个关键领域:肖像生成、驱动机制和编辑技术。我们总结了里程碑式的研究,并对其创新和不足进行了批判性分析。此外,我们组织了一组丰富的数据集,对各种评估指标进行了详细的表现分析,旨在为未来的研究提供一个清晰、可靠的框架和数据支持。最后,我们探讨了谈话头合成应用场景,用具体案例进行了说明,并探讨了未来的研究方向。
https://arxiv.org/abs/2406.10553
Diffusion-based technologies have made significant strides, particularly in personalized and customized facialgeneration. However, existing methods face challenges in achieving high-fidelity and detailed identity (ID)consistency, primarily due to insufficient fine-grained control over facial areas and the lack of a comprehensive strategy for ID preservation by fully considering intricate facial details and the overall face. To address these limitations, we introduce ConsistentID, an innovative method crafted for diverseidentity-preserving portrait generation under fine-grained multimodal facial prompts, utilizing only a single reference image. ConsistentID comprises two key components: a multimodal facial prompt generator that combines facial features, corresponding facial descriptions and the overall facial context to enhance precision in facial details, and an ID-preservation network optimized through the facial attention localization strategy, aimed at preserving ID consistency in facial regions. Together, these components significantly enhance the accuracy of ID preservation by introducing fine-grained multimodal ID information from facial regions. To facilitate training of ConsistentID, we present a fine-grained portrait dataset, FGID, with over 500,000 facial images, offering greater diversity and comprehensiveness than existing public facial datasets. % such as LAION-Face, CelebA, FFHQ, and SFHQ. Experimental results substantiate that our ConsistentID achieves exceptional precision and diversity in personalized facial generation, surpassing existing methods in the MyStyle dataset. Furthermore, while ConsistentID introduces more multimodal ID information, it maintains a fast inference speed during generation.
基于扩散的技术取得了显著的进展,特别是在个性化面部生成方面。然而,现有的方法在实现高保真度和详细身份一致性方面面临挑战,主要原因是面部区域细粒度控制不足,以及没有全面考虑面部细节和整个面部以实现ID保留的策略。为了应对这些局限,我们引入了ConsistentID,一种为在细粒度多模态面部提示下生成多样化身份肖像的创新方法,仅使用单个参考图像。ConsistentID包括两个关键组件:一个多模态面部提示生成器,将面部特征、相应的面部描述和整个面部上下文相结合以提高面部细节的精度,和一个通过面部关注局部定位策略优化的ID保留网络,旨在保留面部区域ID的一致性。这些组件一起显著提高了ID保留的准确性,通过引入面部区域的细粒度多模态ID信息。为了方便ConsistentID的训练,我们提出了一个超过50万张面部图片的细粒度肖像数据集FGID,比现有的公共面部数据集(如LAION-Face,CelebA,FFHQ和SFHQ)具有更大的多样性和完整性。实验结果证实,我们的ConsistentID在个性化面部生成方面实现了非凡的精度和多样性,超过了MyStyle数据集中的现有方法。此外,虽然ConsistentID引入了更多的多模态ID信息,但在生成过程中保持了快速的推理速度。
https://arxiv.org/abs/2404.16771
Existing neural rendering-based text-to-3D-portrait generation methods typically make use of human geometry prior and diffusion models to obtain guidance. However, relying solely on geometry information introduces issues such as the Janus problem, over-saturation, and over-smoothing. We present Portrait3D, a novel neural rendering-based framework with a novel joint geometry-appearance prior to achieve text-to-3D-portrait generation that overcomes the aforementioned issues. To accomplish this, we train a 3D portrait generator, 3DPortraitGAN-Pyramid, as a robust prior. This generator is capable of producing 360° canonical 3D portraits, serving as a starting point for the subsequent diffusion-based generation process. To mitigate the "grid-like" artifact caused by the high-frequency information in the feature-map-based 3D representation commonly used by most 3D-aware GANs, we integrate a novel pyramid tri-grid 3D representation into 3DPortraitGAN-Pyramid. To generate 3D portraits from text, we first project a randomly generated image aligned with the given prompt into the pre-trained 3DPortraitGAN-Pyramid's latent space. The resulting latent code is then used to synthesize a pyramid tri-grid. Beginning with the obtained pyramid tri-grid, we use score distillation sampling to distill the diffusion model's knowledge into the pyramid tri-grid. Following that, we utilize the diffusion model to refine the rendered images of the 3D portrait and then use these refined images as training data to further optimize the pyramid tri-grid, effectively eliminating issues with unrealistic color and unnatural artifacts. Our experimental results show that Portrait3D can produce realistic, high-quality, and canonical 3D portraits that align with the prompt.
现有的基于神经渲染的文本-3D人物生成方法通常利用人体几何信息和扩散模型来获得指导。然而,仅依赖几何信息会引入诸如Janus问题、过度饱和和过度平滑等问题。我们提出了Portrait3D,一种新型的基于神经渲染的框架,具有新颖的联合几何-外观先验,以实现文本-3D人物生成,从而克服上述问题。为了实现这一目标,我们训练了一个3D人物生成器--3DPortraitGAN-Pyramid作为稳健的前体。这个生成器能够生成360°的规范3D人物,作为后续扩散-based生成过程的起点。为了减轻由高频信息引起的“网格状”伪影问题,我们将在3DPortraitGAN-Pyramid中集成一种新颖的等腰三角形3D表示。为了从文本中生成3D人物,我们首先将随机的图像与给定提示对齐,并将其投影到预训练的3DPortraitGAN-Pyramid的潜在空间中。得到的潜在代码随后用于合成等腰三角形。从获得的等腰三角形开始,我们使用评分差异抽样将扩散模型的知识引入到等腰三角形中。接着,我们利用扩散模型优化3D人物渲染图像,然后将这些优化后的图像作为训练数据进一步优化等腰三角形,有效地消除了不真实颜色和异常 artifacts。我们的实验结果表明,Portrait3D可以生成真实、高质量和规范的3D人物,与给定提示相符。
https://arxiv.org/abs/2404.10394
Artistic video portrait generation is a significant and sought-after task in the fields of computer graphics and vision. While various methods have been developed that integrate NeRFs or StyleGANs with instructional editing models for creating and editing drivable portraits, these approaches face several challenges. They often rely heavily on large datasets, require extensive customization processes, and frequently result in reduced image quality. To address the above problems, we propose the Efficient Monotonic Video Style Avatar (Emo-Avatar) through deferred neural rendering that enhances StyleGAN's capacity for producing dynamic, drivable portrait videos. We proposed a two-stage deferred neural rendering pipeline. In the first stage, we utilize few-shot PTI initialization to initialize the StyleGAN generator through several extreme poses sampled from the video to capture the consistent representation of aligned faces from the target portrait. In the second stage, we propose a Laplacian pyramid for high-frequency texture sampling from UV maps deformed by dynamic flow of expression for motion-aware texture prior integration to provide torso features to enhance StyleGAN's ability to generate complete and upper body for portrait video rendering. Emo-Avatar reduces style customization time from hours to merely 5 minutes compared with existing methods. In addition, Emo-Avatar requires only a single reference image for editing and employs region-aware contrastive learning with semantic invariant CLIP guidance, ensuring consistent high-resolution output and identity preservation. Through both quantitative and qualitative assessments, Emo-Avatar demonstrates superior performance over existing methods in terms of training efficiency, rendering quality and editability in self- and cross-reenactment.
艺术视频肖像生成是在计算机图形学和视觉领域的一个重要且备受关注的目标。虽然已经开发了许多将NeRFs或StyleGAN与教学编辑模型相结合的方法来创建和编辑可驾驶肖像,但这些问题仍然存在。它们通常需要依赖大量数据,需要进行广泛的定制,并经常导致图像质量降低。为解决这些问题,我们提出了Efficient Monotonic Video Style Avatar(Emo-Avatar),通过 deferred neural rendering 进行延期神经渲染,以增强StyleGAN在制作动态、可驾驶肖像视频方面的能力。我们提出了一个两阶段延时神经渲染管道。在第一阶段,我们利用少样本PTI初始化来通过从视频中采样极端姿态来初始化StyleGAN生成器,以捕捉目标肖像中始终保持一致的对齐面。在第二阶段,我们提出了Laplacian金字塔用于从变形动态流动表达的UV地图中采样高频率纹理,以实现运动感知纹理先前集成,从而提供躯体特征,增强StyleGAN生成完整和上半身的能力。Emo-Avatar将风格定制时间从小时降低到了仅需5分钟,与现有方法相比具有优越性能。此外,Emo-Avatar只需要一个参考图像进行编辑,并采用基于语义不变的CLIP的局部感知对比学习,确保始终如一的高分辨率输出和身份保留。通过定量和定性评估,Emo-Avatar在自演和跨演等方面的现有方法上表现出卓越的性能。
https://arxiv.org/abs/2402.00827
One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio to generate a talking portrait video. The existing methods fail to simultaneously achieve the goals of accurate 3D avatar reconstruction and stable talking face animation. Besides, while the existing works mainly focus on synthesizing the head part, it is also vital to generate natural torso and background segments to obtain a realistic talking portrait video. To address these limitations, we present Real3D-Potrait, a framework that (1) improves the one-shot 3D reconstruction power with a large image-to-plane model that distills 3D prior knowledge from a 3D face generative model; (2) facilitates accurate motion-conditioned animation with an efficient motion adapter; (3) synthesizes realistic video with natural torso movement and switchable background using a head-torso-background super-resolution model; and (4) supports one-shot audio-driven talking face generation with a generalizable audio-to-motion model. Extensive experiments show that Real3D-Portrait generalizes well to unseen identities and generates more realistic talking portrait videos compared to previous methods.
一次性的3D谈话肖像生成旨在从未见过的图像中重构3D虚拟形象,然后通过参考视频或音频来生成谈话肖像视频。现有方法未能同时实现准确3D虚拟形象重建和稳定的谈话面动画。此外,虽然现有作品主要关注合成头部,但生成自然躯干和背景段也是获得真实谈话肖像视频至关重要。为了应对这些局限,我们提出了Real3D-Potrait,一个框架(1)通过大图像到平面模型的方法提高了一次性3D重建的能力,并从3D人脸生成模型中提取3D先验知识;(2)通过高效的运动适配器促进准确的运动条件动画;(3)使用头-躯干-背景超分辨率模型合成真实视频,并可切换背景;(4)支持基于通用音频到运动模型的单次音频驱动谈话面生成。大量实验证明,Real3D-Portrait对未见过的身份泛化效果很好,并比以前的方法生成了更逼真的谈话肖像视频。
https://arxiv.org/abs/2401.08503
Generating realistic talking faces is an interesting and long-standing topic in the field of computer vision. Although significant progress has been made, it is still challenging to generate high-quality dynamic faces with personalized details. This is mainly due to the inability of the general model to represent personalized details and the generalization problem to unseen controllable parameters. In this work, we propose Myportrait, a simple, general, and flexible framework for neural portrait generation. We incorporate personalized prior in a monocular video and morphable prior in 3D face morphable space for generating personalized details under novel controllable parameters. Our proposed framework supports both video-driven and audio-driven face animation given a monocular video of a single person. Distinguished by whether the test data is sent to training or not, our method provides a real-time online version and a high-quality offline version. Comprehensive experiments in various metrics demonstrate the superior performance of our method over the state-of-the-art methods. The code will be publicly available.
生成逼真的对话脸是一个有趣且长期存在于计算机视觉领域的课题。尽管已经取得了很大的进展,但生成具有个性化细节的高质量动态脸仍然具有挑战性。这主要是因为通用模型无法表示个性化的细节,以及对于未见过的可控制参数的泛化问题。在本文中,我们提出Myportrait,一个简单、通用、灵活的神经肖像生成框架。我们在单目视频上引入个性化的先验,并在3D人脸变形空间中使用可控制参数生成个性化的细节。我们提出的框架支持单目视频驱动和音频驱动人脸动画,给定单目视频,可以提供实时在线版本和高质量离线版本。各种指标的全面实验证明,我们的方法在现有方法中具有卓越的性能。代码将公开可用。
https://arxiv.org/abs/2312.02703
Despite rapid advances in computer graphics, creating high-quality photo-realistic virtual portraits is prohibitively expensive. Furthermore, the well-know ''uncanny valley'' effect in rendered portraits has a significant impact on the user experience, especially when the depiction closely resembles a human likeness, where any minor artifacts can evoke feelings of eeriness and repulsiveness. In this paper, we present a novel photo-realistic portrait generation framework that can effectively mitigate the ''uncanny valley'' effect and improve the overall authenticity of rendered portraits. Our key idea is to employ transfer learning to learn an identity-consistent mapping from the latent space of rendered portraits to that of real portraits. During the inference stage, the input portrait of an avatar can be directly transferred to a realistic portrait by changing its appearance style while maintaining the facial identity. To this end, we collect a new dataset, Daz-Rendered-Faces-HQ (DRFHQ), that is specifically designed for rendering-style portraits. We leverage this dataset to fine-tune the StyleGAN2 generator, using our carefully crafted framework, which helps to preserve the geometric and color features relevant to facial identity. We evaluate our framework using portraits with diverse gender, age, and race variations. Qualitative and quantitative evaluations and ablation studies show the advantages of our method compared to state-of-the-art approaches.
尽管计算机图形学取得了快速进展,但创建高质量的照片现实主义的虚拟肖像可能过于昂贵。此外,已知的“奇异谷”效应在渲染肖像中对用户体验的影响很大,尤其是在描述非常类似于人类肖像的时候,任何细小瑕疵都可能引发奇异和令人厌恶的感觉。在本文中,我们提出了一个新颖的照片现实主义肖像生成框架,可以有效地减轻“奇异谷”效应,提高渲染肖像的总体真实感。我们的关键想法是使用迁移学习从渲染肖像的潜在空间中学到与真实肖像的相似身份一致的映射。在推理阶段,可以通过改变虚拟角色的外观风格来直接将其转移到真实角色上。为此,我们收集了一个专门为渲染风格肖像设计的全新数据集Daz-Rendered-Faces-HQ(DRFHQ)。我们利用这个数据集来微调StyleGAN2生成器,并使用我们精心设计的框架,该框架有助于保留与面部身份相关的几何和色彩特征。我们通过评估具有不同性别、年龄和种族的肖像来评估我们的方法。定性和定量评估以及消融研究结果表明,与最先进的解决方案相比,我们的方法具有优势。
https://arxiv.org/abs/2310.04194
Recent years have witnessed great progress in creating vivid audio-driven portraits from monocular videos. However, how to seamlessly adapt the created video avatars to other scenarios with different backgrounds and lighting conditions remains unsolved. On the other hand, existing relighting studies mostly rely on dynamically lighted or multi-view data, which are too expensive for creating video portraits. To bridge this gap, we propose ReliTalk, a novel framework for relightable audio-driven talking portrait generation from monocular videos. Our key insight is to decompose the portrait's reflectance from implicitly learned audio-driven facial normals and images. Specifically, we involve 3D facial priors derived from audio features to predict delicate normal maps through implicit functions. These initially predicted normals then take a crucial part in reflectance decomposition by dynamically estimating the lighting condition of the given video. Moreover, the stereoscopic face representation is refined using the identity-consistent loss under simulated multiple lighting conditions, addressing the ill-posed problem caused by limited views available from a single monocular video. Extensive experiments validate the superiority of our proposed framework on both real and synthetic datasets. Our code is released in this https URL.
近年来,在从单目视频创建生动音频驱动肖像方面取得了巨大的进展。然而,如何无缝地将创建的视频Avatar适应其他背景和照明条件不同的场景仍然未解决。另一方面,现有的照明研究大多依赖于动态照明或多视图数据,这些对于创建视频肖像来说太贵了。为了解决这个问题,我们提出了ReliTalk,一个可以从单目视频创建可照明音频驱动对话肖像的新框架。我们的关键发现是分解肖像的反射从 implicitly learned 音频驱动面部正常和图像。具体来说,我们涉及从音频特征得出的3D面部先验以通过隐含函数预测脆弱的面部映射。这些起初预测的面部正常随后通过动态估计给定视频的照明条件来关键地参与反射分解。此外,使用模拟多种照明条件相同的损失改进立体面部表示,解决了由单目视频有限视角带来的困难问题。广泛的实验验证了我们提出的框架在真实和合成数据集上的优越性。我们的代码在此httpsURL上发布。
https://arxiv.org/abs/2309.02434
Previous animatable 3D-aware GANs for human generation have primarily focused on either the human head or full body. However, head-only videos are relatively uncommon in real life, and full body generation typically does not deal with facial expression control and still has challenges in generating high-quality results. Towards applicable video avatars, we present an animatable 3D-aware GAN that generates portrait images with controllable facial expression, head pose, and shoulder movements. It is a generative model trained on unstructured 2D image collections without using 3D or video data. For the new task, we base our method on the generative radiance manifold representation and equip it with learnable facial and head-shoulder deformations. A dual-camera rendering and adversarial learning scheme is proposed to improve the quality of the generated faces, which is critical for portrait images. A pose deformation processing network is developed to generate plausible deformations for challenging regions such as long hair. Experiments show that our method, trained on unstructured 2D images, can generate diverse and high-quality 3D portraits with desired control over different properties.
之前的人类生成器使用的可模拟3D感知GAN主要关注人类头部或整个身体的生成。然而,在现实生活中,只生成头部视频较为罕见,而且通常不会涉及到面部表情控制和生成高质量结果的挑战。为了适用于视频虚拟角色,我们提出了一种可模拟3D感知的GAN,能够生成可控制面部表情、头部姿势和肩膀运动的肖像图像。这是一个在不使用3D或视频数据的情况下从无向图表示学习到的生成模型。针对新的任务,我们基于生成光度聚类表示来学习生成器和对抗器,并配备了可学习面部表情和头部姿势变形。我们提出了双摄像头渲染和对抗学习方案来改进生成面部质量,这对于肖像图像至关重要。我们还开发了 pose变形处理网络,用于生成令人信服的变形,例如长发。实验表明,我们的方法在无向图表示学习中训练,能够生成各种高质量的3D肖像,并能够对不同属性进行定制控制。
https://arxiv.org/abs/2309.02186
Recent advancement in personalized image generation have unveiled the intriguing capability of pre-trained text-to-image models on learning identity information from a collection of portrait images. However, existing solutions can be vulnerable in producing truthful details, and usually suffer from several defects such as (i) The generated face exhibit its own unique characteristics, \ie facial shape and facial feature positioning may not resemble key characteristics of the input, and (ii) The synthesized face may contain warped, blurred or corrupted regions. In this paper, we present FaceChain, a personalized portrait generation framework that combines a series of customized image-generation model and a rich set of face-related perceptual understanding models (\eg, face detection, deep face embedding extraction, and facial attribute recognition), to tackle aforementioned challenges and to generate truthful personalized portraits, with only a handful of portrait images as input. Concretely, we inject several SOTA face models into the generation procedure, achieving a more efficient label-tagging, data-processing, and model post-processing compared to previous solutions, such as DreamBooth ~\cite{ruiz2023dreambooth} , InstantBooth ~\cite{shi2023instantbooth} , or other LoRA-only approaches ~\cite{hu2021lora} . Through the development of FaceChain, we have identified several potential directions to accelerate development of Face/Human-Centric AIGC research and application. We have designed FaceChain as a framework comprised of pluggable components that can be easily adjusted to accommodate different styles and personalized needs. We hope it can grow to serve the burgeoning needs from the communities. FaceChain is open-sourced under Apache-2.0 license at \url{this https URL}.
最近的个性化图像生成技术的进步揭示了预训练文本到图像模型从一组肖像图像中学习身份信息的独特能力。然而,现有的解决方案在生成真实细节方面可能存在脆弱性,通常会出现多个缺陷,例如(i)生成的面部呈现其自身的独特特征, \ie 面部形状和面部特征位置可能不像输入的关键特征相似,(ii)合成的面部可能包含扭曲、模糊或失真的区域。在本文中,我们介绍了 FaceChain,一个个性化的肖像生成框架,它结合了一系列定制的图像生成模型和大量的面部相关感知理解模型,以解决上述挑战并生成只有少量肖像图像输入的真实个性化肖像。具体而言,我们注入 several SOTA 面部模型到生成过程,比过去的解决方案更高效地进行标签标注、数据处理和模型后处理,相比 Dreambooth ~\cite{ruiz2023dreambooth}、Instantbooth ~\cite{shi2023Instantbooth} 或 other LoRA-only approaches ~\cite{hu2021lora} 等方案更加高效。通过开发 FaceChain,我们识别了几个可能的方向,以加速 Face/人类中心 AIGC 研究和应用程序的发展。我们设计了 FaceChain,作为一个可插拔组件组成的框架,可以轻松适应不同的风格和个性化需求。我们希望它能够成长来满足社区不断增长的需求。FaceChain 采用 Apache-2.0 许可证开源。
https://arxiv.org/abs/2308.14256