Portrait Fidelity Generation is a prominent research area in generative models, with a primary focus on enhancing both controllability and fidelity. Current methods face challenges in generating high-fidelity portrait results when faces occupy a small portion of the image with a low resolution, especially in multi-person group photo settings. To tackle these issues, we propose a systematic solution called MagicID, based on a self-constructed million-level multi-modal dataset named IDZoom. MagicID consists of Multi-Mode Fusion training strategy (MMF) and DDIM Inversion based ID Restoration inference framework (DIIR). During training, MMF iteratively uses the skeleton and landmark modalities from IDZoom as conditional guidance. By introducing the Clone Face Tuning in training stage and Mask Guided Multi-ID Cross Attention (MGMICA) in inference stage, explicit constraints on face positional features are achieved for multi-ID group photo generation. The DIIR aims to address the issue of artifacts. The DDIM Inversion is used in conjunction with face landmarks, global and local face features to achieve face restoration while keeping the background unchanged. Additionally, DIIR is plug-and-play and can be applied to any diffusion-based portrait generation method. To validate the effectiveness of MagicID, we conducted extensive comparative and ablation experiments. The experimental results demonstrate that MagicID has significant advantages in both subjective and objective metrics, and achieves controllable generation in multi-person scenarios.
肖像可靠性生成是一个在生成模型领域突出研究领域的知名领域,主要关注提高可控制性和可靠性。当面对低分辨率时,当前方法在生成高保真度肖像结果时存在挑战,尤其是在多人物群照设置中。为解决这些问题,我们提出了一个基于自构建的100万级多模态数据集IDZoom的系统解决方案,称为MagicID。MagicID包括多模态融合训练策略(MMF)和基于ID修复的DDIM Inversion。在训练过程中,MMF使用IDZoom中的骨架和关键点模态作为条件指导。通过引入训练阶段的人脸克隆和引导多ID照Cross注意力(MGMICA)以及在推理阶段使用DDIM Inversion,实现了对面部位置特征的显式约束,用于多ID照群照生成。 DIIR旨在解决伪影问题。将DDIM Inversion与面部关键点、全局和局部面部特征相结合,可以在保留背景不变的情况下实现面部修复。此外,DIIR是可插拔的,可以应用于任何基于扩散的肖像生成方法。为了验证MagicID的有效性,我们进行了广泛的比较和消融实验。实验结果表明,MagicID在主观和客观指标上具有显著优势,并在多人物群照场景中实现了可控制性的生成。
https://arxiv.org/abs/2408.09248
Customized image generation, which seeks to synthesize images with consistent characters, holds significant relevance for applications such as storytelling, portrait generation, and character design. However, previous approaches have encountered challenges in preserving characters with high-fidelity consistency due to inadequate feature extraction and concept confusion of reference characters. Therefore, we propose Character-Adapter, a plug-and-play framework designed to generate images that preserve the details of reference characters, ensuring high-fidelity consistency. Character-Adapter employs prompt-guided segmentation to ensure fine-grained regional features of reference characters and dynamic region-level adapters to mitigate concept confusion. Extensive experiments are conducted to validate the effectiveness of Character-Adapter. Both quantitative and qualitative results demonstrate that Character-Adapter achieves the state-of-the-art performance of consistent character generation, with an improvement of 24.8% compared with other methods
定制图像生成,旨在合成具有一致性字符的图像,对于诸如叙事、肖像生成和角色设计等应用具有重要的相关性。然而,由于参考角色的特征提取不足和参考角色概念混淆,以前的方法在保留高保真度特征方面遇到了挑战。因此,我们提出了Character-Adapter,一个可插拔和使用的框架,旨在生成保留参考角色详细信息的图像,确保高保真度一致性。Character-Adapter采用提示引导分割来确保参考角色的细粒度区域特征,并使用动态区域级别适应器来缓解概念混淆。 extensive实验验证了Character-Adapter的有效性。量化结果表明,Character-Adapter实现了与其他方法相当的最佳性能,性能提高了24.8%。
https://arxiv.org/abs/2406.16537
Talking head synthesis, an advanced method for generating portrait videos from a still image driven by specific content, has garnered widespread attention in virtual reality, augmented reality and game production. Recently, significant breakthroughs have been made with the introduction of novel models such as the transformer and the diffusion model. Current methods can not only generate new content but also edit the generated material. This survey systematically reviews the technology, categorizing it into three pivotal domains: portrait generation, driven mechanisms, and editing techniques. We summarize milestone studies and critically analyze their innovations and shortcomings within each domain. Additionally, we organize an extensive collection of datasets and provide a thorough performance analysis of current methodologies based on various evaluation metrics, aiming to furnish a clear framework and robust data support for future research. Finally, we explore application scenarios of talking head synthesis, illustrate them with specific cases, and examine potential future directions.
谈话头合成是一种先进的方法,用于从特定内容驱动的静态图像中生成视频肖像。它在虚拟现实、增强现实和游戏制作领域引起了广泛关注。近年来,随着新模型的引入,如Transformer和扩散模型,取得了显著的突破。现有的方法不仅能生成新的内容,还可以编辑生成材料。本调查系统地回顾了该技术,将其分为三个关键领域:肖像生成、驱动机制和编辑技术。我们总结了里程碑式的研究,并对其创新和不足进行了批判性分析。此外,我们组织了一组丰富的数据集,对各种评估指标进行了详细的表现分析,旨在为未来的研究提供一个清晰、可靠的框架和数据支持。最后,我们探讨了谈话头合成应用场景,用具体案例进行了说明,并探讨了未来的研究方向。
https://arxiv.org/abs/2406.10553
Diffusion-based technologies have made significant strides, particularly in personalized and customized facialgeneration. However, existing methods face challenges in achieving high-fidelity and detailed identity (ID)consistency, primarily due to insufficient fine-grained control over facial areas and the lack of a comprehensive strategy for ID preservation by fully considering intricate facial details and the overall face. To address these limitations, we introduce ConsistentID, an innovative method crafted for diverseidentity-preserving portrait generation under fine-grained multimodal facial prompts, utilizing only a single reference image. ConsistentID comprises two key components: a multimodal facial prompt generator that combines facial features, corresponding facial descriptions and the overall facial context to enhance precision in facial details, and an ID-preservation network optimized through the facial attention localization strategy, aimed at preserving ID consistency in facial regions. Together, these components significantly enhance the accuracy of ID preservation by introducing fine-grained multimodal ID information from facial regions. To facilitate training of ConsistentID, we present a fine-grained portrait dataset, FGID, with over 500,000 facial images, offering greater diversity and comprehensiveness than existing public facial datasets. % such as LAION-Face, CelebA, FFHQ, and SFHQ. Experimental results substantiate that our ConsistentID achieves exceptional precision and diversity in personalized facial generation, surpassing existing methods in the MyStyle dataset. Furthermore, while ConsistentID introduces more multimodal ID information, it maintains a fast inference speed during generation.
基于扩散的技术取得了显著的进展,特别是在个性化面部生成方面。然而,现有的方法在实现高保真度和详细身份一致性方面面临挑战,主要原因是面部区域细粒度控制不足,以及没有全面考虑面部细节和整个面部以实现ID保留的策略。为了应对这些局限,我们引入了ConsistentID,一种为在细粒度多模态面部提示下生成多样化身份肖像的创新方法,仅使用单个参考图像。ConsistentID包括两个关键组件:一个多模态面部提示生成器,将面部特征、相应的面部描述和整个面部上下文相结合以提高面部细节的精度,和一个通过面部关注局部定位策略优化的ID保留网络,旨在保留面部区域ID的一致性。这些组件一起显著提高了ID保留的准确性,通过引入面部区域的细粒度多模态ID信息。为了方便ConsistentID的训练,我们提出了一个超过50万张面部图片的细粒度肖像数据集FGID,比现有的公共面部数据集(如LAION-Face,CelebA,FFHQ和SFHQ)具有更大的多样性和完整性。实验结果证实,我们的ConsistentID在个性化面部生成方面实现了非凡的精度和多样性,超过了MyStyle数据集中的现有方法。此外,虽然ConsistentID引入了更多的多模态ID信息,但在生成过程中保持了快速的推理速度。
https://arxiv.org/abs/2404.16771
Existing neural rendering-based text-to-3D-portrait generation methods typically make use of human geometry prior and diffusion models to obtain guidance. However, relying solely on geometry information introduces issues such as the Janus problem, over-saturation, and over-smoothing. We present Portrait3D, a novel neural rendering-based framework with a novel joint geometry-appearance prior to achieve text-to-3D-portrait generation that overcomes the aforementioned issues. To accomplish this, we train a 3D portrait generator, 3DPortraitGAN-Pyramid, as a robust prior. This generator is capable of producing 360° canonical 3D portraits, serving as a starting point for the subsequent diffusion-based generation process. To mitigate the "grid-like" artifact caused by the high-frequency information in the feature-map-based 3D representation commonly used by most 3D-aware GANs, we integrate a novel pyramid tri-grid 3D representation into 3DPortraitGAN-Pyramid. To generate 3D portraits from text, we first project a randomly generated image aligned with the given prompt into the pre-trained 3DPortraitGAN-Pyramid's latent space. The resulting latent code is then used to synthesize a pyramid tri-grid. Beginning with the obtained pyramid tri-grid, we use score distillation sampling to distill the diffusion model's knowledge into the pyramid tri-grid. Following that, we utilize the diffusion model to refine the rendered images of the 3D portrait and then use these refined images as training data to further optimize the pyramid tri-grid, effectively eliminating issues with unrealistic color and unnatural artifacts. Our experimental results show that Portrait3D can produce realistic, high-quality, and canonical 3D portraits that align with the prompt.
现有的基于神经渲染的文本-3D人物生成方法通常利用人体几何信息和扩散模型来获得指导。然而,仅依赖几何信息会引入诸如Janus问题、过度饱和和过度平滑等问题。我们提出了Portrait3D,一种新型的基于神经渲染的框架,具有新颖的联合几何-外观先验,以实现文本-3D人物生成,从而克服上述问题。为了实现这一目标,我们训练了一个3D人物生成器--3DPortraitGAN-Pyramid作为稳健的前体。这个生成器能够生成360°的规范3D人物,作为后续扩散-based生成过程的起点。为了减轻由高频信息引起的“网格状”伪影问题,我们将在3DPortraitGAN-Pyramid中集成一种新颖的等腰三角形3D表示。为了从文本中生成3D人物,我们首先将随机的图像与给定提示对齐,并将其投影到预训练的3DPortraitGAN-Pyramid的潜在空间中。得到的潜在代码随后用于合成等腰三角形。从获得的等腰三角形开始,我们使用评分差异抽样将扩散模型的知识引入到等腰三角形中。接着,我们利用扩散模型优化3D人物渲染图像,然后将这些优化后的图像作为训练数据进一步优化等腰三角形,有效地消除了不真实颜色和异常 artifacts。我们的实验结果表明,Portrait3D可以生成真实、高质量和规范的3D人物,与给定提示相符。
https://arxiv.org/abs/2404.10394
Artistic video portrait generation is a significant and sought-after task in the fields of computer graphics and vision. While various methods have been developed that integrate NeRFs or StyleGANs with instructional editing models for creating and editing drivable portraits, these approaches face several challenges. They often rely heavily on large datasets, require extensive customization processes, and frequently result in reduced image quality. To address the above problems, we propose the Efficient Monotonic Video Style Avatar (Emo-Avatar) through deferred neural rendering that enhances StyleGAN's capacity for producing dynamic, drivable portrait videos. We proposed a two-stage deferred neural rendering pipeline. In the first stage, we utilize few-shot PTI initialization to initialize the StyleGAN generator through several extreme poses sampled from the video to capture the consistent representation of aligned faces from the target portrait. In the second stage, we propose a Laplacian pyramid for high-frequency texture sampling from UV maps deformed by dynamic flow of expression for motion-aware texture prior integration to provide torso features to enhance StyleGAN's ability to generate complete and upper body for portrait video rendering. Emo-Avatar reduces style customization time from hours to merely 5 minutes compared with existing methods. In addition, Emo-Avatar requires only a single reference image for editing and employs region-aware contrastive learning with semantic invariant CLIP guidance, ensuring consistent high-resolution output and identity preservation. Through both quantitative and qualitative assessments, Emo-Avatar demonstrates superior performance over existing methods in terms of training efficiency, rendering quality and editability in self- and cross-reenactment.
艺术视频肖像生成是在计算机图形学和视觉领域的一个重要且备受关注的目标。虽然已经开发了许多将NeRFs或StyleGAN与教学编辑模型相结合的方法来创建和编辑可驾驶肖像,但这些问题仍然存在。它们通常需要依赖大量数据,需要进行广泛的定制,并经常导致图像质量降低。为解决这些问题,我们提出了Efficient Monotonic Video Style Avatar(Emo-Avatar),通过 deferred neural rendering 进行延期神经渲染,以增强StyleGAN在制作动态、可驾驶肖像视频方面的能力。我们提出了一个两阶段延时神经渲染管道。在第一阶段,我们利用少样本PTI初始化来通过从视频中采样极端姿态来初始化StyleGAN生成器,以捕捉目标肖像中始终保持一致的对齐面。在第二阶段,我们提出了Laplacian金字塔用于从变形动态流动表达的UV地图中采样高频率纹理,以实现运动感知纹理先前集成,从而提供躯体特征,增强StyleGAN生成完整和上半身的能力。Emo-Avatar将风格定制时间从小时降低到了仅需5分钟,与现有方法相比具有优越性能。此外,Emo-Avatar只需要一个参考图像进行编辑,并采用基于语义不变的CLIP的局部感知对比学习,确保始终如一的高分辨率输出和身份保留。通过定量和定性评估,Emo-Avatar在自演和跨演等方面的现有方法上表现出卓越的性能。
https://arxiv.org/abs/2402.00827
One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio to generate a talking portrait video. The existing methods fail to simultaneously achieve the goals of accurate 3D avatar reconstruction and stable talking face animation. Besides, while the existing works mainly focus on synthesizing the head part, it is also vital to generate natural torso and background segments to obtain a realistic talking portrait video. To address these limitations, we present Real3D-Potrait, a framework that (1) improves the one-shot 3D reconstruction power with a large image-to-plane model that distills 3D prior knowledge from a 3D face generative model; (2) facilitates accurate motion-conditioned animation with an efficient motion adapter; (3) synthesizes realistic video with natural torso movement and switchable background using a head-torso-background super-resolution model; and (4) supports one-shot audio-driven talking face generation with a generalizable audio-to-motion model. Extensive experiments show that Real3D-Portrait generalizes well to unseen identities and generates more realistic talking portrait videos compared to previous methods.
一次性的3D谈话肖像生成旨在从未见过的图像中重构3D虚拟形象,然后通过参考视频或音频来生成谈话肖像视频。现有方法未能同时实现准确3D虚拟形象重建和稳定的谈话面动画。此外,虽然现有作品主要关注合成头部,但生成自然躯干和背景段也是获得真实谈话肖像视频至关重要。为了应对这些局限,我们提出了Real3D-Potrait,一个框架(1)通过大图像到平面模型的方法提高了一次性3D重建的能力,并从3D人脸生成模型中提取3D先验知识;(2)通过高效的运动适配器促进准确的运动条件动画;(3)使用头-躯干-背景超分辨率模型合成真实视频,并可切换背景;(4)支持基于通用音频到运动模型的单次音频驱动谈话面生成。大量实验证明,Real3D-Portrait对未见过的身份泛化效果很好,并比以前的方法生成了更逼真的谈话肖像视频。
https://arxiv.org/abs/2401.08503
Generating realistic talking faces is an interesting and long-standing topic in the field of computer vision. Although significant progress has been made, it is still challenging to generate high-quality dynamic faces with personalized details. This is mainly due to the inability of the general model to represent personalized details and the generalization problem to unseen controllable parameters. In this work, we propose Myportrait, a simple, general, and flexible framework for neural portrait generation. We incorporate personalized prior in a monocular video and morphable prior in 3D face morphable space for generating personalized details under novel controllable parameters. Our proposed framework supports both video-driven and audio-driven face animation given a monocular video of a single person. Distinguished by whether the test data is sent to training or not, our method provides a real-time online version and a high-quality offline version. Comprehensive experiments in various metrics demonstrate the superior performance of our method over the state-of-the-art methods. The code will be publicly available.
生成逼真的对话脸是一个有趣且长期存在于计算机视觉领域的课题。尽管已经取得了很大的进展,但生成具有个性化细节的高质量动态脸仍然具有挑战性。这主要是因为通用模型无法表示个性化的细节,以及对于未见过的可控制参数的泛化问题。在本文中,我们提出Myportrait,一个简单、通用、灵活的神经肖像生成框架。我们在单目视频上引入个性化的先验,并在3D人脸变形空间中使用可控制参数生成个性化的细节。我们提出的框架支持单目视频驱动和音频驱动人脸动画,给定单目视频,可以提供实时在线版本和高质量离线版本。各种指标的全面实验证明,我们的方法在现有方法中具有卓越的性能。代码将公开可用。
https://arxiv.org/abs/2312.02703
Despite rapid advances in computer graphics, creating high-quality photo-realistic virtual portraits is prohibitively expensive. Furthermore, the well-know ''uncanny valley'' effect in rendered portraits has a significant impact on the user experience, especially when the depiction closely resembles a human likeness, where any minor artifacts can evoke feelings of eeriness and repulsiveness. In this paper, we present a novel photo-realistic portrait generation framework that can effectively mitigate the ''uncanny valley'' effect and improve the overall authenticity of rendered portraits. Our key idea is to employ transfer learning to learn an identity-consistent mapping from the latent space of rendered portraits to that of real portraits. During the inference stage, the input portrait of an avatar can be directly transferred to a realistic portrait by changing its appearance style while maintaining the facial identity. To this end, we collect a new dataset, Daz-Rendered-Faces-HQ (DRFHQ), that is specifically designed for rendering-style portraits. We leverage this dataset to fine-tune the StyleGAN2 generator, using our carefully crafted framework, which helps to preserve the geometric and color features relevant to facial identity. We evaluate our framework using portraits with diverse gender, age, and race variations. Qualitative and quantitative evaluations and ablation studies show the advantages of our method compared to state-of-the-art approaches.
尽管计算机图形学取得了快速进展,但创建高质量的照片现实主义的虚拟肖像可能过于昂贵。此外,已知的“奇异谷”效应在渲染肖像中对用户体验的影响很大,尤其是在描述非常类似于人类肖像的时候,任何细小瑕疵都可能引发奇异和令人厌恶的感觉。在本文中,我们提出了一个新颖的照片现实主义肖像生成框架,可以有效地减轻“奇异谷”效应,提高渲染肖像的总体真实感。我们的关键想法是使用迁移学习从渲染肖像的潜在空间中学到与真实肖像的相似身份一致的映射。在推理阶段,可以通过改变虚拟角色的外观风格来直接将其转移到真实角色上。为此,我们收集了一个专门为渲染风格肖像设计的全新数据集Daz-Rendered-Faces-HQ(DRFHQ)。我们利用这个数据集来微调StyleGAN2生成器,并使用我们精心设计的框架,该框架有助于保留与面部身份相关的几何和色彩特征。我们通过评估具有不同性别、年龄和种族的肖像来评估我们的方法。定性和定量评估以及消融研究结果表明,与最先进的解决方案相比,我们的方法具有优势。
https://arxiv.org/abs/2310.04194
Recent years have witnessed great progress in creating vivid audio-driven portraits from monocular videos. However, how to seamlessly adapt the created video avatars to other scenarios with different backgrounds and lighting conditions remains unsolved. On the other hand, existing relighting studies mostly rely on dynamically lighted or multi-view data, which are too expensive for creating video portraits. To bridge this gap, we propose ReliTalk, a novel framework for relightable audio-driven talking portrait generation from monocular videos. Our key insight is to decompose the portrait's reflectance from implicitly learned audio-driven facial normals and images. Specifically, we involve 3D facial priors derived from audio features to predict delicate normal maps through implicit functions. These initially predicted normals then take a crucial part in reflectance decomposition by dynamically estimating the lighting condition of the given video. Moreover, the stereoscopic face representation is refined using the identity-consistent loss under simulated multiple lighting conditions, addressing the ill-posed problem caused by limited views available from a single monocular video. Extensive experiments validate the superiority of our proposed framework on both real and synthetic datasets. Our code is released in this https URL.
近年来,在从单目视频创建生动音频驱动肖像方面取得了巨大的进展。然而,如何无缝地将创建的视频Avatar适应其他背景和照明条件不同的场景仍然未解决。另一方面,现有的照明研究大多依赖于动态照明或多视图数据,这些对于创建视频肖像来说太贵了。为了解决这个问题,我们提出了ReliTalk,一个可以从单目视频创建可照明音频驱动对话肖像的新框架。我们的关键发现是分解肖像的反射从 implicitly learned 音频驱动面部正常和图像。具体来说,我们涉及从音频特征得出的3D面部先验以通过隐含函数预测脆弱的面部映射。这些起初预测的面部正常随后通过动态估计给定视频的照明条件来关键地参与反射分解。此外,使用模拟多种照明条件相同的损失改进立体面部表示,解决了由单目视频有限视角带来的困难问题。广泛的实验验证了我们提出的框架在真实和合成数据集上的优越性。我们的代码在此httpsURL上发布。
https://arxiv.org/abs/2309.02434
Previous animatable 3D-aware GANs for human generation have primarily focused on either the human head or full body. However, head-only videos are relatively uncommon in real life, and full body generation typically does not deal with facial expression control and still has challenges in generating high-quality results. Towards applicable video avatars, we present an animatable 3D-aware GAN that generates portrait images with controllable facial expression, head pose, and shoulder movements. It is a generative model trained on unstructured 2D image collections without using 3D or video data. For the new task, we base our method on the generative radiance manifold representation and equip it with learnable facial and head-shoulder deformations. A dual-camera rendering and adversarial learning scheme is proposed to improve the quality of the generated faces, which is critical for portrait images. A pose deformation processing network is developed to generate plausible deformations for challenging regions such as long hair. Experiments show that our method, trained on unstructured 2D images, can generate diverse and high-quality 3D portraits with desired control over different properties.
之前的人类生成器使用的可模拟3D感知GAN主要关注人类头部或整个身体的生成。然而,在现实生活中,只生成头部视频较为罕见,而且通常不会涉及到面部表情控制和生成高质量结果的挑战。为了适用于视频虚拟角色,我们提出了一种可模拟3D感知的GAN,能够生成可控制面部表情、头部姿势和肩膀运动的肖像图像。这是一个在不使用3D或视频数据的情况下从无向图表示学习到的生成模型。针对新的任务,我们基于生成光度聚类表示来学习生成器和对抗器,并配备了可学习面部表情和头部姿势变形。我们提出了双摄像头渲染和对抗学习方案来改进生成面部质量,这对于肖像图像至关重要。我们还开发了 pose变形处理网络,用于生成令人信服的变形,例如长发。实验表明,我们的方法在无向图表示学习中训练,能够生成各种高质量的3D肖像,并能够对不同属性进行定制控制。
https://arxiv.org/abs/2309.02186
Recent advancement in personalized image generation have unveiled the intriguing capability of pre-trained text-to-image models on learning identity information from a collection of portrait images. However, existing solutions can be vulnerable in producing truthful details, and usually suffer from several defects such as (i) The generated face exhibit its own unique characteristics, \ie facial shape and facial feature positioning may not resemble key characteristics of the input, and (ii) The synthesized face may contain warped, blurred or corrupted regions. In this paper, we present FaceChain, a personalized portrait generation framework that combines a series of customized image-generation model and a rich set of face-related perceptual understanding models (\eg, face detection, deep face embedding extraction, and facial attribute recognition), to tackle aforementioned challenges and to generate truthful personalized portraits, with only a handful of portrait images as input. Concretely, we inject several SOTA face models into the generation procedure, achieving a more efficient label-tagging, data-processing, and model post-processing compared to previous solutions, such as DreamBooth ~\cite{ruiz2023dreambooth} , InstantBooth ~\cite{shi2023instantbooth} , or other LoRA-only approaches ~\cite{hu2021lora} . Through the development of FaceChain, we have identified several potential directions to accelerate development of Face/Human-Centric AIGC research and application. We have designed FaceChain as a framework comprised of pluggable components that can be easily adjusted to accommodate different styles and personalized needs. We hope it can grow to serve the burgeoning needs from the communities. FaceChain is open-sourced under Apache-2.0 license at \url{this https URL}.
最近的个性化图像生成技术的进步揭示了预训练文本到图像模型从一组肖像图像中学习身份信息的独特能力。然而,现有的解决方案在生成真实细节方面可能存在脆弱性,通常会出现多个缺陷,例如(i)生成的面部呈现其自身的独特特征, \ie 面部形状和面部特征位置可能不像输入的关键特征相似,(ii)合成的面部可能包含扭曲、模糊或失真的区域。在本文中,我们介绍了 FaceChain,一个个性化的肖像生成框架,它结合了一系列定制的图像生成模型和大量的面部相关感知理解模型,以解决上述挑战并生成只有少量肖像图像输入的真实个性化肖像。具体而言,我们注入 several SOTA 面部模型到生成过程,比过去的解决方案更高效地进行标签标注、数据处理和模型后处理,相比 Dreambooth ~\cite{ruiz2023dreambooth}、Instantbooth ~\cite{shi2023Instantbooth} 或 other LoRA-only approaches ~\cite{hu2021lora} 等方案更加高效。通过开发 FaceChain,我们识别了几个可能的方向,以加速 Face/人类中心 AIGC 研究和应用程序的发展。我们设计了 FaceChain,作为一个可插拔组件组成的框架,可以轻松适应不同的风格和个性化需求。我们希望它能够成长来满足社区不断增长的需求。FaceChain 采用 Apache-2.0 许可证开源。
https://arxiv.org/abs/2308.14256
Audio-driven portrait animation aims to synthesize portrait videos that are conditioned by given audio. Animating high-fidelity and multimodal video portraits has a variety of applications. Previous methods have attempted to capture different motion modes and generate high-fidelity portrait videos by training different models or sampling signals from given videos. However, lacking correlation learning between lip-sync and other movements (e.g., head pose/eye blinking) usually leads to unnatural results. In this paper, we propose a unified system for multi-person, diverse, and high-fidelity talking portrait generation. Our method contains three stages, i.e., 1) Mapping-Once network with Dual Attentions (MODA) generates talking representation from given audio. In MODA, we design a dual-attention module to encode accurate mouth movements and diverse modalities. 2) Facial composer network generates dense and detailed face landmarks, and 3) temporal-guided renderer syntheses stable videos. Extensive evaluations demonstrate that the proposed system produces more natural and realistic video portraits compared to previous methods.
Audio驱动的肖像动画旨在生成由给定音频条件生成的肖像视频。动画高保真的多媒质肖像具有多种应用。先前的方法曾试图捕捉不同运动模式并生成高保真的肖像视频,通过训练不同模型或从给定视频中采样信号来实现。然而,缺乏 lips同步和其他运动(例如头部姿势和眨眼)的相关性学习通常会导致不自然的结果。在本文中,我们提出了一种多人、多样化和高保真的对话肖像生成统一系统。我们的方法和有三个阶段,即1)基于双重注意力的一次性网络(MODA)从给定音频生成对话表示。在MODA中,我们设计了一个双重注意力模块来编码准确的口部运动和多种模式。2)面部构建网络生成密集且详细的面部地标,3)时间引导渲染合成稳定视频。广泛的评估表明,与先前的方法相比,我们提出的系统生成更自然、真实的视频肖像。
https://arxiv.org/abs/2307.10008
Text-to-3D is an emerging task that allows users to create 3D content with infinite possibilities. Existing works tackle the problem by optimizing a 3D representation with guidance from pre-trained diffusion models. An apparent drawback is that they need to optimize from scratch for each prompt, which is computationally expensive and often yields poor visual fidelity. In this paper, we propose DreamPortrait, which aims to generate text-guided 3D-aware portraits in a single-forward pass for efficiency. To achieve this, we extend Score Distillation Sampling from datapoint to distribution formulation, which injects semantic prior into a 3D distribution. However, the direct extension will lead to the mode collapse problem since the objective only pursues semantic alignment. Hence, we propose to optimize a distribution with hierarchical condition adapters and GAN loss regularization. For better 3D modeling, we further design a 3D-aware gated cross-attention mechanism to explicitly let the model perceive the correspondence between the text and the 3D-aware space. These elaborated designs enable our model to generate portraits with robust multi-view semantic consistency, eliminating the need for optimization-based methods. Extensive experiments demonstrate our model's highly competitive performance and significant speed boost against existing methods.
文本到三维是一项新兴任务,它允许用户以无限的可能创造三维内容。现有的工作通过从训练好的扩散模型中指导优化三维表示来解决这个问题。一个明显的缺点是他们需要为每个提示都重新优化,这是计算代价高昂的,并且通常会导致视觉效果不佳。在本文中,我们提出了梦想肖像,它旨在以高效的方式从文本引导的三维意识肖像中生成。为了实现这个目标,我们将评分蒸馏采样扩展到分布 formulation,将语义先验注入到三维分布中。然而,直接扩展将会导致模式崩溃问题,因为目标只是追求语义匹配。因此,我们提议使用分层条件适配器和GAN损失 Regularization 来优化分布。为了提供更好的三维建模,我们还设计了三维意识闭路交叉注意力机制,以明确让模型感知文本和三维意识空间之间的对应关系。这些 elaborate 的设计使我们能够生成具有稳健多视角语义一致性的肖像,从而不再需要基于优化的方法。广泛的实验证明了我们的模型的高竞争力表现以及与现有方法的重大速度提升。
https://arxiv.org/abs/2306.02083
Nowadays, the wide application of virtual digital human promotes the comprehensive prosperity and development of digital culture supported by digital economy. The personalized portrait automatically generated by AI technology needs both the natural artistic style and human sentiment. In this paper, we propose a novel StyleIdentityGAN model, which can ensure the identity and artistry of the generated portrait at the same time. Specifically, the style-enhanced module focuses on artistic style features decoupling and transferring to improve the artistry of generated virtual face images. Meanwhile, the identity-enhanced module preserves the significant features extracted from the input photo. Furthermore, the proposed method requires a small number of reference style data. Experiments demonstrate the superiority of StyleIdentityGAN over state-of-art methods in artistry and identity effects, with comparisons done qualitatively, quantitatively and through a perceptual user study. Code has been released on Github3.
Nowadays, the widespread application of virtual digital human promotes the comprehensive prosperity and development of digital culture supported by digital economy. The personalized portrait automatically generated by AI technology needs both the natural artistic style and human sentiment. In this paper, we propose a novel StyleIdentityGAN model, which can ensure the identity and artistry of the generated portrait at the same time. Specifically, the style-enhanced module focuses on artistic style features decoupling and transferred to improve the artistry of generated virtual face images. Meanwhile, the identity-enhanced module preserves the significant features extracted from the input photo. Furthermore, the proposed method requires a small number of reference style data. experiments demonstrate the superiority of StyleIdentityGAN over state-of-art methods in artistry and identity effects, with comparisons done qualitatively, quantitatively and through a perceptual user study. Code has been released on Github3.
https://arxiv.org/abs/2303.00377
While recent research has progressively overcome the low-resolution constraint of one-shot face video re-enactment with the help of StyleGAN's high-fidelity portrait generation, these approaches rely on at least one of the following: explicit 2D/3D priors, optical flow based warping as motion descriptors, off-the-shelf encoders, etc., which constrain their performance (e.g., inconsistent predictions, inability to capture fine facial details and accessories, poor generalization, artifacts). We propose an end-to-end framework for simultaneously supporting face attribute edits, facial motions and deformations, and facial identity control for video generation. It employs a hybrid latent-space that encodes a given frame into a pair of latents: Identity latent, $\mathcal{W}_{ID}$, and Facial deformation latent, $\mathcal{S}_F$, that respectively reside in the $W+$ and $SS$ spaces of StyleGAN2. Thereby, incorporating the impressive editability-distortion trade-off of $W+$ and the high disentanglement properties of $SS$. These hybrid latents employ the StyleGAN2 generator to achieve high-fidelity face video re-enactment at $1024^2$. Furthermore, the model supports the generation of realistic re-enactment videos with other latent-based semantic edits (e.g., beard, age, make-up, etc.). Qualitative and quantitative analyses performed against state-of-the-art methods demonstrate the superiority of the proposed approach.
近年来,通过使用风格GAN的高保真肖像生成技术,逐渐克服了一次性面部视频重制低分辨率的限制。但这些方法至少依赖于以下一种: explicit 2D/3D priors,基于光学流的扭曲作为运动描述器,常见的编码器,等等,这些限制影响了其表现(例如,不一致的预测,无法捕捉 fine facial details 和配件, poor generalization, 人为错误)。我们提出了一个端到端的框架,可以同时支持面部属性编辑、面部运动和变形,以及视频生成时的面部身份控制。它使用了一个混合的隐状态空间,将给定帧编码为一对隐状态:身份隐状态, $mathcal{W}_{ID}$,和面部变形隐状态, $mathcal{S}_F$,分别位于StyleGAN2的$W+$和$SS$空间中。因此,结合了$W+$的令人印象深刻编辑能力与$SS$的高分离能力。这些混合隐状态使用StyleGAN2生成器来实现1024^2级的高保真面部视频重制。此外,模型还支持与其他隐状态语义编辑相关的真实重制视频生成(例如,胡须、年龄、化妆等)。与当前最佳方法进行比较的定性和定量分析表明,我们提出的方法具有优越性。
https://arxiv.org/abs/2302.07848
Creating the photo-realistic version of people sketched portraits is useful to various entertainment purposes. Existing studies only generate portraits in the 2D plane with fixed views, making the results less vivid. In this paper, we present Stereoscopic Simplified Sketch-to-Portrait (SSSP), which explores the possibility of creating Stereoscopic 3D-aware portraits from simple contour sketches by involving 3D generative models. Our key insight is to design sketch-aware constraints that can fully exploit the prior knowledge of a tri-plane-based 3D-aware generative model. Specifically, our designed region-aware volume rendering strategy and global consistency constraint further enhance detail correspondences during sketch encoding. Moreover, in order to facilitate the usage of layman users, we propose a Contour-to-Sketch module with vector quantized representations, so that easily drawn contours can directly guide the generation of 3D portraits. Extensive comparisons show that our method generates high-quality results that match the sketch. Our usability study verifies that our system is greatly preferred by user.
创造绘制的肖像的逼真版本对于多种娱乐用途是有用的。现有的研究只能从固定视角下在2D平面上生成肖像,导致结果不够生动。在本文中,我们提出了立体简化 Sketch-to-Portrait (SSSP),该方法探索了从简单的轮廓 Sketch 创建立体3D感知肖像的可能性。我们的关键发现是设计 Sketch aware 约束,充分利用基于三平面的3D感知生成模型的先前知识。具体来说,我们设计的区域感知体积渲染策略和全局一致性约束在 Sketch 编码期间进一步增强细节对应关系。此外,为了便于一般用户使用,我们提出了轮廓到 Sketch 模块,使用向量量化表示,使得易于绘制的轮廓可以直接影响3D肖像的生成。广泛的比较表明,我们的方法生成了与 Sketch 匹配高质量的结果。我们的使用研究证实了用户对我们系统的巨大偏好。
https://arxiv.org/abs/2302.06857
In contrast to the traditional avatar creation pipeline which is a costly process, contemporary generative approaches directly learn the data distribution from photographs and the state of the arts can now yield highly photo-realistic images. While plenty of works attempt to extend the unconditional generative models and achieve some level of controllability, it is still challenging to ensure multi-view consistency, especially in large poses. In this work, we propose a 3D portrait generation network that produces 3D consistent portraits while being controllable according to semantic parameters regarding pose, identity, expression and lighting. The generative network uses neural scene representation to model portraits in 3D, whose generation is guided by a parametric face model that supports explicit control. While the latent disentanglement can be further enhanced by contrasting images with partially different attributes, there still exists noticeable inconsistency in non-face areas, e.g., hair and background, when animating expressions. We solve this by proposing a volume blending strategy in which we form a composite output by blending the dynamic and static radiance fields, with two parts segmented from the jointly learned semantic field. Our method outperforms prior arts in extensive experiments, producing realistic portraits with vivid expression in natural lighting when viewed in free viewpoint. The proposed method also demonstrates generalization ability to real images as well as out-of-domain cartoon faces, showing great promise in real applications. Additional video results and code will be available on the project webpage.
https://arxiv.org/abs/2209.05434
Over the years, 2D GANs have achieved great successes in photorealistic portrait generation. However, they lack 3D understanding in the generation process, thus they suffer from multi-view inconsistency problem. To alleviate the issue, many 3D-aware GANs have been proposed and shown notable results, but 3D GANs struggle with editing semantic attributes. The controllability and interpretability of 3D GANs have not been much explored. In this work, we propose two solutions to overcome these weaknesses of 2D GANs and 3D-aware GANs. We first introduce a novel 3D-aware GAN, SURF-GAN, which is capable of discovering semantic attributes during training and controlling them in an unsupervised manner. After that, we inject the prior of SURF-GAN into StyleGAN to obtain a high-fidelity 3D-controllable generator. Unlike existing latent-based methods allowing implicit pose control, the proposed 3D-controllable StyleGAN enables explicit pose control over portrait generation. This distillation allows direct compatibility between 3D control and many StyleGAN-based techniques (e.g., inversion and stylization), and also brings an advantage in terms of computational resources. Our codes are available at this https URL.
https://arxiv.org/abs/2207.10257
This article presents an evolutionary approach for synthetic human portraits generation based on the latent space exploration of a generative adversarial network. The idea is to produce different human face images very similar to a given target portrait. The approach applies StyleGAN2 for portrait generation and FaceNet for face similarity evaluation. The evolutionary search is based on exploring the real-coded latent space of StyleGAN2. The main results over both synthetic and real images indicate that the proposed approach generates accurate and diverse solutions, which represent realistic human portraits. The proposed research can contribute to improving the security of face recognition systems.
https://arxiv.org/abs/2204.11887