Portrait_Generation

Portrait3D: Text-Guided High-Quality 3D Portrait Generation Using Pyramid Representation and GANs Prior

2024-04-16 08:52:42

Yiqian Wu, Hao Xu, Xiangjun Tang, Xien Chen, Siyu Tang, Zhebin Zhang, Chen Li, Xiaogang Jin

arXiv_CV

arXiv_CV GAN Image_Generation Portrait_Generation Knowledge 3D Diffusion
Abstract

Existing neural rendering-based text-to-3D-portrait generation methods typically make use of human geometry prior and diffusion models to obtain guidance. However, relying solely on geometry information introduces issues such as the Janus problem, over-saturation, and over-smoothing. We present Portrait3D, a novel neural rendering-based framework with a novel joint geometry-appearance prior to achieve text-to-3D-portrait generation that overcomes the aforementioned issues. To accomplish this, we train a 3D portrait generator, 3DPortraitGAN-Pyramid, as a robust prior. This generator is capable of producing 360° canonical 3D portraits, serving as a starting point for the subsequent diffusion-based generation process. To mitigate the "grid-like" artifact caused by the high-frequency information in the feature-map-based 3D representation commonly used by most 3D-aware GANs, we integrate a novel pyramid tri-grid 3D representation into 3DPortraitGAN-Pyramid. To generate 3D portraits from text, we first project a randomly generated image aligned with the given prompt into the pre-trained 3DPortraitGAN-Pyramid's latent space. The resulting latent code is then used to synthesize a pyramid tri-grid. Beginning with the obtained pyramid tri-grid, we use score distillation sampling to distill the diffusion model's knowledge into the pyramid tri-grid. Following that, we utilize the diffusion model to refine the rendered images of the 3D portrait and then use these refined images as training data to further optimize the pyramid tri-grid, effectively eliminating issues with unrealistic color and unnatural artifacts. Our experimental results show that Portrait3D can produce realistic, high-quality, and canonical 3D portraits that align with the prompt.

Abstract (translated)

现有的基于神经渲染的文本-3D人物生成方法通常利用人体几何信息和扩散模型来获得指导。然而，仅依赖几何信息会引入诸如Janus问题、过度饱和和过度平滑等问题。我们提出了Portrait3D，一种新型的基于神经渲染的框架，具有新颖的联合几何-外观先验，以实现文本-3D人物生成，从而克服上述问题。为了实现这一目标，我们训练了一个3D人物生成器--3DPortraitGAN-Pyramid作为稳健的前体。这个生成器能够生成360°的规范3D人物，作为后续扩散-based生成过程的起点。为了减轻由高频信息引起的“网格状”伪影问题，我们将在3DPortraitGAN-Pyramid中集成一种新颖的等腰三角形3D表示。为了从文本中生成3D人物，我们首先将随机的图像与给定提示对齐，并将其投影到预训练的3DPortraitGAN-Pyramid的潜在空间中。得到的潜在代码随后用于合成等腰三角形。从获得的等腰三角形开始，我们使用评分差异抽样将扩散模型的知识引入到等腰三角形中。接着，我们利用扩散模型优化3D人物渲染图像，然后将这些优化后的图像作为训练数据进一步优化等腰三角形，有效地消除了不真实颜色和异常 artifacts。我们的实验结果表明，Portrait3D可以生成真实、高质量和规范的3D人物，与给定提示相符。

URL

https://arxiv.org/abs/2404.10394

PDF

https://arxiv.org/pdf/2404.10394.pdf
Read All
Emo-Avatar: Efficient Monocular Video Style Avatar through Texture Rendering

2024-02-01 18:14:42

Pinxin Liu, Luchuan Song, Daoan Zhang, Hang Hua, Yunlong Tang, Huaijin Tu, Jiebo Luo, Chenliang Xu

arXiv_CV

arXiv_CV GAN Image_Generation Portrait_Generation Face Quantitative Pose Few-Shot Contrastive_Learning
Abstract

Artistic video portrait generation is a significant and sought-after task in the fields of computer graphics and vision. While various methods have been developed that integrate NeRFs or StyleGANs with instructional editing models for creating and editing drivable portraits, these approaches face several challenges. They often rely heavily on large datasets, require extensive customization processes, and frequently result in reduced image quality. To address the above problems, we propose the Efficient Monotonic Video Style Avatar (Emo-Avatar) through deferred neural rendering that enhances StyleGAN's capacity for producing dynamic, drivable portrait videos. We proposed a two-stage deferred neural rendering pipeline. In the first stage, we utilize few-shot PTI initialization to initialize the StyleGAN generator through several extreme poses sampled from the video to capture the consistent representation of aligned faces from the target portrait. In the second stage, we propose a Laplacian pyramid for high-frequency texture sampling from UV maps deformed by dynamic flow of expression for motion-aware texture prior integration to provide torso features to enhance StyleGAN's ability to generate complete and upper body for portrait video rendering. Emo-Avatar reduces style customization time from hours to merely 5 minutes compared with existing methods. In addition, Emo-Avatar requires only a single reference image for editing and employs region-aware contrastive learning with semantic invariant CLIP guidance, ensuring consistent high-resolution output and identity preservation. Through both quantitative and qualitative assessments, Emo-Avatar demonstrates superior performance over existing methods in terms of training efficiency, rendering quality and editability in self- and cross-reenactment.

Abstract (translated)

艺术视频肖像生成是在计算机图形学和视觉领域的一个重要且备受关注的目标。虽然已经开发了许多将NeRFs或StyleGAN与教学编辑模型相结合的方法来创建和编辑可驾驶肖像，但这些问题仍然存在。它们通常需要依赖大量数据，需要进行广泛的定制，并经常导致图像质量降低。为解决这些问题，我们提出了Efficient Monotonic Video Style Avatar（Emo-Avatar），通过 deferred neural rendering 进行延期神经渲染，以增强StyleGAN在制作动态、可驾驶肖像视频方面的能力。我们提出了一个两阶段延时神经渲染管道。在第一阶段，我们利用少样本PTI初始化来通过从视频中采样极端姿态来初始化StyleGAN生成器，以捕捉目标肖像中始终保持一致的对齐面。在第二阶段，我们提出了Laplacian金字塔用于从变形动态流动表达的UV地图中采样高频率纹理，以实现运动感知纹理先前集成，从而提供躯体特征，增强StyleGAN生成完整和上半身的能力。Emo-Avatar将风格定制时间从小时降低到了仅需5分钟，与现有方法相比具有优越性能。此外，Emo-Avatar只需要一个参考图像进行编辑，并采用基于语义不变的CLIP的局部感知对比学习，确保始终如一的高分辨率输出和身份保留。通过定量和定性评估，Emo-Avatar在自演和跨演等方面的现有方法上表现出卓越的性能。

URL

https://arxiv.org/abs/2402.00827

PDF

https://arxiv.org/pdf/2402.00827.pdf
Read All
Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis

2024-01-16 17:04:30

Zhenhui Ye, Tianyun Zhong, Yi Ren, Jiaqi Yang, Weichuang Li, Jiawei Huang, Ziyue Jiang, Jinzheng He, Rongjie Huang, Jinglin Liu, Chen Zhang, Xiang Yin, Zejun Ma, Zhou Zhao

arXiv_CV

arXiv_CV Image_Generation Portrait_Generation Super_Resolution Face Knowledge 3D Reconstruction
Abstract

One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio to generate a talking portrait video. The existing methods fail to simultaneously achieve the goals of accurate 3D avatar reconstruction and stable talking face animation. Besides, while the existing works mainly focus on synthesizing the head part, it is also vital to generate natural torso and background segments to obtain a realistic talking portrait video. To address these limitations, we present Real3D-Potrait, a framework that (1) improves the one-shot 3D reconstruction power with a large image-to-plane model that distills 3D prior knowledge from a 3D face generative model; (2) facilitates accurate motion-conditioned animation with an efficient motion adapter; (3) synthesizes realistic video with natural torso movement and switchable background using a head-torso-background super-resolution model; and (4) supports one-shot audio-driven talking face generation with a generalizable audio-to-motion model. Extensive experiments show that Real3D-Portrait generalizes well to unseen identities and generates more realistic talking portrait videos compared to previous methods.

Abstract (translated)

一次性的3D谈话肖像生成旨在从未见过的图像中重构3D虚拟形象，然后通过参考视频或音频来生成谈话肖像视频。现有方法未能同时实现准确3D虚拟形象重建和稳定的谈话面动画。此外，虽然现有作品主要关注合成头部，但生成自然躯干和背景段也是获得真实谈话肖像视频至关重要。为了应对这些局限，我们提出了Real3D-Potrait，一个框架（1）通过大图像到平面模型的方法提高了一次性3D重建的能力，并从3D人脸生成模型中提取3D先验知识；（2）通过高效的运动适配器促进准确的运动条件动画；（3）使用头-躯干-背景超分辨率模型合成真实视频，并可切换背景；（4）支持基于通用音频到运动模型的单次音频驱动谈话面生成。大量实验证明，Real3D-Portrait对未见过的身份泛化效果很好，并比以前的方法生成了更逼真的谈话肖像视频。

URL

https://arxiv.org/abs/2401.08503

PDF

https://arxiv.org/pdf/2401.08503.pdf
Read All
MyPortrait: Morphable Prior-Guided Personalized Portrait Generation

2023-12-05 12:05:01

Bo Ding, Zhenfeng Fan, Shuang Yang, Shihong Xia

arXiv_CV

arXiv_CV Image_Generation Portrait_Generation Face Pose 3D
Abstract

Generating realistic talking faces is an interesting and long-standing topic in the field of computer vision. Although significant progress has been made, it is still challenging to generate high-quality dynamic faces with personalized details. This is mainly due to the inability of the general model to represent personalized details and the generalization problem to unseen controllable parameters. In this work, we propose Myportrait, a simple, general, and flexible framework for neural portrait generation. We incorporate personalized prior in a monocular video and morphable prior in 3D face morphable space for generating personalized details under novel controllable parameters. Our proposed framework supports both video-driven and audio-driven face animation given a monocular video of a single person. Distinguished by whether the test data is sent to training or not, our method provides a real-time online version and a high-quality offline version. Comprehensive experiments in various metrics demonstrate the superior performance of our method over the state-of-the-art methods. The code will be publicly available.

Abstract (translated)

生成逼真的对话脸是一个有趣且长期存在于计算机视觉领域的课题。尽管已经取得了很大的进展,但生成具有个性化细节的高质量动态脸仍然具有挑战性。这主要是因为通用模型无法表示个性化的细节,以及对于未见过的可控制参数的泛化问题。在本文中,我们提出Myportrait,一个简单、通用、灵活的神经肖像生成框架。我们在单目视频上引入个性化的先验,并在3D人脸变形空间中使用可控制参数生成个性化的细节。我们提出的框架支持单目视频驱动和音频驱动人脸动画,给定单目视频,可以提供实时在线版本和高质量离线版本。各种指标的全面实验证明,我们的方法在现有方法中具有卓越的性能。代码将公开可用。

URL

https://arxiv.org/abs/2312.02703

PDF

https://arxiv.org/pdf/2312.02703.pdf
Read All
Enhancing the Authenticity of Rendered Portraits with Identity-Consistent Transfer Learning

2023-10-06 12:20:40

Luyuan Wang, Yiqian Wu, Yongliang Yang, Chen Liu, Xiaogang Jin

arXiv_CV

arXiv_CV GAN Image_Generation Portrait_Generation Face Transfer_Learning Inference Quantitative
Abstract

Despite rapid advances in computer graphics, creating high-quality photo-realistic virtual portraits is prohibitively expensive. Furthermore, the well-know ''uncanny valley'' effect in rendered portraits has a significant impact on the user experience, especially when the depiction closely resembles a human likeness, where any minor artifacts can evoke feelings of eeriness and repulsiveness. In this paper, we present a novel photo-realistic portrait generation framework that can effectively mitigate the ''uncanny valley'' effect and improve the overall authenticity of rendered portraits. Our key idea is to employ transfer learning to learn an identity-consistent mapping from the latent space of rendered portraits to that of real portraits. During the inference stage, the input portrait of an avatar can be directly transferred to a realistic portrait by changing its appearance style while maintaining the facial identity. To this end, we collect a new dataset, Daz-Rendered-Faces-HQ (DRFHQ), that is specifically designed for rendering-style portraits. We leverage this dataset to fine-tune the StyleGAN2 generator, using our carefully crafted framework, which helps to preserve the geometric and color features relevant to facial identity. We evaluate our framework using portraits with diverse gender, age, and race variations. Qualitative and quantitative evaluations and ablation studies show the advantages of our method compared to state-of-the-art approaches.

Abstract (translated)

尽管计算机图形学取得了快速进展，但创建高质量的照片现实主义的虚拟肖像可能过于昂贵。此外，已知的“奇异谷”效应在渲染肖像中对用户体验的影响很大，尤其是在描述非常类似于人类肖像的时候，任何细小瑕疵都可能引发奇异和令人厌恶的感觉。在本文中，我们提出了一个新颖的照片现实主义肖像生成框架，可以有效地减轻“奇异谷”效应，提高渲染肖像的总体真实感。我们的关键想法是使用迁移学习从渲染肖像的潜在空间中学到与真实肖像的相似身份一致的映射。在推理阶段，可以通过改变虚拟角色的外观风格来直接将其转移到真实角色上。为此，我们收集了一个专门为渲染风格肖像设计的全新数据集Daz-Rendered-Faces-HQ（DRFHQ）。我们利用这个数据集来微调StyleGAN2生成器，并使用我们精心设计的框架，该框架有助于保留与面部身份相关的几何和色彩特征。我们通过评估具有不同性别、年龄和种族的肖像来评估我们的方法。定性和定量评估以及消融研究结果表明，与最先进的解决方案相比，我们的方法具有优势。

URL

https://arxiv.org/abs/2310.04194

PDF

https://arxiv.org/pdf/2310.04194.pdf
Read All
ReliTalk: Relightable Talking Portrait Generation from a Single Video

2023-09-05 17:59:42

Haonan Qiu, Zhaoxi Chen, Yuming Jiang, Hang Zhou, Xiangyu Fan, Lei Yang, Wayne Wu, Ziwei Liu

arXiv_CV

arXiv_CV Image_Generation Portrait_Generation Face Pose 3D
Abstract

Recent years have witnessed great progress in creating vivid audio-driven portraits from monocular videos. However, how to seamlessly adapt the created video avatars to other scenarios with different backgrounds and lighting conditions remains unsolved. On the other hand, existing relighting studies mostly rely on dynamically lighted or multi-view data, which are too expensive for creating video portraits. To bridge this gap, we propose ReliTalk, a novel framework for relightable audio-driven talking portrait generation from monocular videos. Our key insight is to decompose the portrait's reflectance from implicitly learned audio-driven facial normals and images. Specifically, we involve 3D facial priors derived from audio features to predict delicate normal maps through implicit functions. These initially predicted normals then take a crucial part in reflectance decomposition by dynamically estimating the lighting condition of the given video. Moreover, the stereoscopic face representation is refined using the identity-consistent loss under simulated multiple lighting conditions, addressing the ill-posed problem caused by limited views available from a single monocular video. Extensive experiments validate the superiority of our proposed framework on both real and synthetic datasets. Our code is released in this https URL.

Abstract (translated)

近年来,在从单目视频创建生动音频驱动肖像方面取得了巨大的进展。然而,如何无缝地将创建的视频Avatar适应其他背景和照明条件不同的场景仍然未解决。另一方面,现有的照明研究大多依赖于动态照明或多视图数据,这些对于创建视频肖像来说太贵了。为了解决这个问题,我们提出了ReliTalk,一个可以从单目视频创建可照明音频驱动对话肖像的新框架。我们的关键发现是分解肖像的反射从 implicitly learned 音频驱动面部正常和图像。具体来说,我们涉及从音频特征得出的3D面部先验以通过隐含函数预测脆弱的面部映射。这些起初预测的面部正常随后通过动态估计给定视频的照明条件来关键地参与反射分解。此外,使用模拟多种照明条件相同的损失改进立体面部表示,解决了由单目视频有限视角带来的困难问题。广泛的实验验证了我们提出的框架在真实和合成数据集上的优越性。我们的代码在此httpsURL上发布。

URL

https://arxiv.org/abs/2309.02434

PDF

https://arxiv.org/pdf/2309.02434.pdf
Read All
AniPortraitGAN: Animatable 3D Portrait Generation from 2D Image Collections

2023-09-05 12:44:57

Yue Wu, Sicheng Xu, Jianfeng Xiang, Fangyun Wei, Qifeng Chen, Jiaolong Yang, Xin Tong

arXiv_AI

arXiv_AI GAN Image_Generation Portrait_Generation Adversarial Face Pose 3D
Abstract

Previous animatable 3D-aware GANs for human generation have primarily focused on either the human head or full body. However, head-only videos are relatively uncommon in real life, and full body generation typically does not deal with facial expression control and still has challenges in generating high-quality results. Towards applicable video avatars, we present an animatable 3D-aware GAN that generates portrait images with controllable facial expression, head pose, and shoulder movements. It is a generative model trained on unstructured 2D image collections without using 3D or video data. For the new task, we base our method on the generative radiance manifold representation and equip it with learnable facial and head-shoulder deformations. A dual-camera rendering and adversarial learning scheme is proposed to improve the quality of the generated faces, which is critical for portrait images. A pose deformation processing network is developed to generate plausible deformations for challenging regions such as long hair. Experiments show that our method, trained on unstructured 2D images, can generate diverse and high-quality 3D portraits with desired control over different properties.

Abstract (translated)

之前的人类生成器使用的可模拟3D感知GAN主要关注人类头部或整个身体的生成。然而,在现实生活中,只生成头部视频较为罕见,而且通常不会涉及到面部表情控制和生成高质量结果的挑战。为了适用于视频虚拟角色,我们提出了一种可模拟3D感知的GAN,能够生成可控制面部表情、头部姿势和肩膀运动的肖像图像。这是一个在不使用3D或视频数据的情况下从无向图表示学习到的生成模型。针对新的任务,我们基于生成光度聚类表示来学习生成器和对抗器,并配备了可学习面部表情和头部姿势变形。我们提出了双摄像头渲染和对抗学习方案来改进生成面部质量,这对于肖像图像至关重要。我们还开发了 pose变形处理网络,用于生成令人信服的变形,例如长发。实验表明,我们的方法在无向图表示学习中训练,能够生成各种高质量的3D肖像,并能够对不同属性进行定制控制。

URL

https://arxiv.org/abs/2309.02186

PDF

https://arxiv.org/pdf/2309.02186.pdf
Read All
FaceChain: A Playground for Identity-Preserving Portrait Generation

2023-08-28 02:20:44

Yang Liu, Cheng Yu, Lei Shang, Ziheng Wu, Xingjun Wang, Yuze Zhao, Lin Zhu, Chen Cheng, Weitao Chen, Chao Xu, Haoyu Xie, Yuan Yao, Wenmeng Zhou, Yingda Chen, Xuansong Xie, Baigui Sun

arXiv_AI

arXiv_AI Recognition Image_Generation Detection Portrait_Generation Face Face_Detection Embedding Action 3D
Abstract

Recent advancement in personalized image generation have unveiled the intriguing capability of pre-trained text-to-image models on learning identity information from a collection of portrait images. However, existing solutions can be vulnerable in producing truthful details, and usually suffer from several defects such as (i) The generated face exhibit its own unique characteristics, \ie facial shape and facial feature positioning may not resemble key characteristics of the input, and (ii) The synthesized face may contain warped, blurred or corrupted regions. In this paper, we present FaceChain, a personalized portrait generation framework that combines a series of customized image-generation model and a rich set of face-related perceptual understanding models (\eg, face detection, deep face embedding extraction, and facial attribute recognition), to tackle aforementioned challenges and to generate truthful personalized portraits, with only a handful of portrait images as input. Concretely, we inject several SOTA face models into the generation procedure, achieving a more efficient label-tagging, data-processing, and model post-processing compared to previous solutions, such as DreamBooth ~\cite{ruiz2023dreambooth} , InstantBooth ~\cite{shi2023instantbooth} , or other LoRA-only approaches ~\cite{hu2021lora} . Through the development of FaceChain, we have identified several potential directions to accelerate development of Face/Human-Centric AIGC research and application. We have designed FaceChain as a framework comprised of pluggable components that can be easily adjusted to accommodate different styles and personalized needs. We hope it can grow to serve the burgeoning needs from the communities. FaceChain is open-sourced under Apache-2.0 license at \url{this https URL}.

Abstract (translated)

最近的个性化图像生成技术的进步揭示了预训练文本到图像模型从一组肖像图像中学习身份信息的独特能力。然而,现有的解决方案在生成真实细节方面可能存在脆弱性,通常会出现多个缺陷,例如(i)生成的面部呈现其自身的独特特征, \ie 面部形状和面部特征位置可能不像输入的关键特征相似,(ii)合成的面部可能包含扭曲、模糊或失真的区域。在本文中,我们介绍了 FaceChain,一个个性化的肖像生成框架,它结合了一系列定制的图像生成模型和大量的面部相关感知理解模型,以解决上述挑战并生成只有少量肖像图像输入的真实个性化肖像。具体而言,我们注入 several SOTA 面部模型到生成过程,比过去的解决方案更高效地进行标签标注、数据处理和模型后处理,相比 Dreambooth ~\cite{ruiz2023dreambooth}、Instantbooth ~\cite{shi2023Instantbooth} 或 other LoRA-only approaches ~\cite{hu2021lora} 等方案更加高效。通过开发 FaceChain,我们识别了几个可能的方向,以加速 Face/人类中心 AIGC 研究和应用程序的发展。我们设计了 FaceChain,作为一个可插拔组件组成的框架,可以轻松适应不同的风格和个性化需求。我们希望它能够成长来满足社区不断增长的需求。FaceChain 采用 Apache-2.0 许可证开源。

URL

https://arxiv.org/abs/2308.14256

PDF

https://arxiv.org/pdf/2308.14256.pdf
Read All
MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions

2023-07-19 14:45:11

Yunfei Liu, Lijian Lin, Fei Yu, Changyin Zhou, Yu Li

arXiv_CV

arXiv_CV Image_Generation Portrait_Generation Face Attention Relation Pose
Abstract

Audio-driven portrait animation aims to synthesize portrait videos that are conditioned by given audio. Animating high-fidelity and multimodal video portraits has a variety of applications. Previous methods have attempted to capture different motion modes and generate high-fidelity portrait videos by training different models or sampling signals from given videos. However, lacking correlation learning between lip-sync and other movements (e.g., head pose/eye blinking) usually leads to unnatural results. In this paper, we propose a unified system for multi-person, diverse, and high-fidelity talking portrait generation. Our method contains three stages, i.e., 1) Mapping-Once network with Dual Attentions (MODA) generates talking representation from given audio. In MODA, we design a dual-attention module to encode accurate mouth movements and diverse modalities. 2) Facial composer network generates dense and detailed face landmarks, and 3) temporal-guided renderer syntheses stable videos. Extensive evaluations demonstrate that the proposed system produces more natural and realistic video portraits compared to previous methods.

Abstract (translated)

Audio驱动的肖像动画旨在生成由给定音频条件生成的肖像视频。动画高保真的多媒质肖像具有多种应用。先前的方法曾试图捕捉不同运动模式并生成高保真的肖像视频,通过训练不同模型或从给定视频中采样信号来实现。然而,缺乏 lips同步和其他运动(例如头部姿势和眨眼)的相关性学习通常会导致不自然的结果。在本文中,我们提出了一种多人、多样化和高保真的对话肖像生成统一系统。我们的方法和有三个阶段,即1)基于双重注意力的一次性网络(MODA)从给定音频生成对话表示。在MODA中,我们设计了一个双重注意力模块来编码准确的口部运动和多种模式。2)面部构建网络生成密集且详细的面部地标,3)时间引导渲染合成稳定视频。广泛的评估表明,与先前的方法相比,我们提出的系统生成更自然、真实的视频肖像。

URL

https://arxiv.org/abs/2307.10008

PDF

https://arxiv.org/pdf/2307.10008.pdf
Read All
Efficient Text-Guided 3D-Aware Portrait Generation with Score Distillation Sampling on Distribution

2023-06-03 11:08:38

Yiji Cheng, Fei Yin, Xiaoke Huang, Xintong Yu, Jiaxiang Liu, Shikun Feng, Yujiu Yang, Yansong Tang

arXiv_CV

arXiv_CV GAN Image_Generation Regularization Portrait_Generation Attention Optimization Pose 3D Diffusion
Abstract

Text-to-3D is an emerging task that allows users to create 3D content with infinite possibilities. Existing works tackle the problem by optimizing a 3D representation with guidance from pre-trained diffusion models. An apparent drawback is that they need to optimize from scratch for each prompt, which is computationally expensive and often yields poor visual fidelity. In this paper, we propose DreamPortrait, which aims to generate text-guided 3D-aware portraits in a single-forward pass for efficiency. To achieve this, we extend Score Distillation Sampling from datapoint to distribution formulation, which injects semantic prior into a 3D distribution. However, the direct extension will lead to the mode collapse problem since the objective only pursues semantic alignment. Hence, we propose to optimize a distribution with hierarchical condition adapters and GAN loss regularization. For better 3D modeling, we further design a 3D-aware gated cross-attention mechanism to explicitly let the model perceive the correspondence between the text and the 3D-aware space. These elaborated designs enable our model to generate portraits with robust multi-view semantic consistency, eliminating the need for optimization-based methods. Extensive experiments demonstrate our model's highly competitive performance and significant speed boost against existing methods.

Abstract (translated)

文本到三维是一项新兴任务,它允许用户以无限的可能创造三维内容。现有的工作通过从训练好的扩散模型中指导优化三维表示来解决这个问题。一个明显的缺点是他们需要为每个提示都重新优化,这是计算代价高昂的,并且通常会导致视觉效果不佳。在本文中,我们提出了梦想肖像,它旨在以高效的方式从文本引导的三维意识肖像中生成。为了实现这个目标,我们将评分蒸馏采样扩展到分布 formulation,将语义先验注入到三维分布中。然而,直接扩展将会导致模式崩溃问题,因为目标只是追求语义匹配。因此,我们提议使用分层条件适配器和GAN损失 Regularization 来优化分布。为了提供更好的三维建模,我们还设计了三维意识闭路交叉注意力机制,以明确让模型感知文本和三维意识空间之间的对应关系。这些 elaborate 的设计使我们能够生成具有稳健多视角语义一致性的肖像,从而不再需要基于优化的方法。广泛的实验证明了我们的模型的高竞争力表现以及与现有方法的重大速度提升。

URL

https://arxiv.org/abs/2306.02083

PDF

https://arxiv.org/pdf/2306.02083.pdf
Read All
Few-shots Portrait Generation with Style Enhancement and Identity Preservation

2023-03-01 10:02:12

Runchuan Zhu, Naye Ji, Youbing Zhao, Fan Zhang

arXiv_CV

arXiv_CV GAN Image_Generation Portrait_Generation Face Sentiment Quantitative Pose Few-Shot Enhancement
Abstract

Nowadays, the wide application of virtual digital human promotes the comprehensive prosperity and development of digital culture supported by digital economy. The personalized portrait automatically generated by AI technology needs both the natural artistic style and human sentiment. In this paper, we propose a novel StyleIdentityGAN model, which can ensure the identity and artistry of the generated portrait at the same time. Specifically, the style-enhanced module focuses on artistic style features decoupling and transferring to improve the artistry of generated virtual face images. Meanwhile, the identity-enhanced module preserves the significant features extracted from the input photo. Furthermore, the proposed method requires a small number of reference style data. Experiments demonstrate the superiority of StyleIdentityGAN over state-of-art methods in artistry and identity effects, with comparisons done qualitatively, quantitatively and through a perceptual user study. Code has been released on Github3.

Abstract (translated)

Nowadays, the widespread application of virtual digital human promotes the comprehensive prosperity and development of digital culture supported by digital economy. The personalized portrait automatically generated by AI technology needs both the natural artistic style and human sentiment. In this paper, we propose a novel StyleIdentityGAN model, which can ensure the identity and artistry of the generated portrait at the same time. Specifically, the style-enhanced module focuses on artistic style features decoupling and transferred to improve the artistry of generated virtual face images. Meanwhile, the identity-enhanced module preserves the significant features extracted from the input photo. Furthermore, the proposed method requires a small number of reference style data. experiments demonstrate the superiority of StyleIdentityGAN over state-of-art methods in artistry and identity effects, with comparisons done qualitatively, quantitatively and through a perceptual user study. Code has been released on Github3.

URL

https://arxiv.org/abs/2303.00377

PDF

https://arxiv.org/pdf/2303.00377.pdf
Read All
One-Shot Face Video Re-enactment using Hybrid Latent Spaces of StyleGAN2

2023-02-15 18:34:15

Trevine Oorloff, Yaser Yacoob

arXiv_CV

arXiv_CV GAN Image_Generation Portrait_Generation Face Prediction Quantitative Pose 3D Optical_Flow
Abstract

While recent research has progressively overcome the low-resolution constraint of one-shot face video re-enactment with the help of StyleGAN's high-fidelity portrait generation, these approaches rely on at least one of the following: explicit 2D/3D priors, optical flow based warping as motion descriptors, off-the-shelf encoders, etc., which constrain their performance (e.g., inconsistent predictions, inability to capture fine facial details and accessories, poor generalization, artifacts). We propose an end-to-end framework for simultaneously supporting face attribute edits, facial motions and deformations, and facial identity control for video generation. It employs a hybrid latent-space that encodes a given frame into a pair of latents: Identity latent, $\mathcal{W}_{ID}$, and Facial deformation latent, $\mathcal{S}_F$, that respectively reside in the $W+$ and $SS$ spaces of StyleGAN2. Thereby, incorporating the impressive editability-distortion trade-off of $W+$ and the high disentanglement properties of $SS$. These hybrid latents employ the StyleGAN2 generator to achieve high-fidelity face video re-enactment at $1024^2$. Furthermore, the model supports the generation of realistic re-enactment videos with other latent-based semantic edits (e.g., beard, age, make-up, etc.). Qualitative and quantitative analyses performed against state-of-the-art methods demonstrate the superiority of the proposed approach.

Abstract (translated)

近年来,通过使用风格GAN的高保真肖像生成技术,逐渐克服了一次性面部视频重制低分辨率的限制。但这些方法至少依赖于以下一种: explicit 2D/3D priors,基于光学流的扭曲作为运动描述器,常见的编码器,等等,这些限制影响了其表现(例如,不一致的预测,无法捕捉 fine facial details 和配件, poor generalization, 人为错误)。我们提出了一个端到端的框架,可以同时支持面部属性编辑、面部运动和变形,以及视频生成时的面部身份控制。它使用了一个混合的隐状态空间,将给定帧编码为一对隐状态:身份隐状态, $mathcal{W}_{ID}$,和面部变形隐状态, $mathcal{S}_F$,分别位于StyleGAN2的$W+$和$SS$空间中。因此,结合了$W+$的令人印象深刻编辑能力与$SS$的高分离能力。这些混合隐状态使用StyleGAN2生成器来实现1024^2级的高保真面部视频重制。此外,模型还支持与其他隐状态语义编辑相关的真实重制视频生成(例如,胡须、年龄、化妆等)。与当前最佳方法进行比较的定性和定量分析表明,我们提出的方法具有优越性。

URL

https://arxiv.org/abs/2302.07848

PDF

https://arxiv.org/pdf/2302.07848.pdf
Read All
Make Your Brief Stroke Real and Stereoscopic: 3D-Aware Simplified Sketch to Portrait Generation

2023-02-14 06:28:42

Yasheng Sun, Qianyi Wu, Hang Zhou, Kaisiyuan Wang, Tianshu Hu, Chen-Chieh Liao, Dongliang He, Jingtuo Liu, Errui Ding, Jingdong Wang, Shio Miyafuji, Ziwei Liu, Hideki Koike

arXiv_CV

arXiv_CV Image_Generation Portrait_Generation Knowledge Pose 3D Sketch Contour
Abstract

Creating the photo-realistic version of people sketched portraits is useful to various entertainment purposes. Existing studies only generate portraits in the 2D plane with fixed views, making the results less vivid. In this paper, we present Stereoscopic Simplified Sketch-to-Portrait (SSSP), which explores the possibility of creating Stereoscopic 3D-aware portraits from simple contour sketches by involving 3D generative models. Our key insight is to design sketch-aware constraints that can fully exploit the prior knowledge of a tri-plane-based 3D-aware generative model. Specifically, our designed region-aware volume rendering strategy and global consistency constraint further enhance detail correspondences during sketch encoding. Moreover, in order to facilitate the usage of layman users, we propose a Contour-to-Sketch module with vector quantized representations, so that easily drawn contours can directly guide the generation of 3D portraits. Extensive comparisons show that our method generates high-quality results that match the sketch. Our usability study verifies that our system is greatly preferred by user.

Abstract (translated)

创造绘制的肖像的逼真版本对于多种娱乐用途是有用的。现有的研究只能从固定视角下在2D平面上生成肖像,导致结果不够生动。在本文中,我们提出了立体简化 Sketch-to-Portrait (SSSP),该方法探索了从简单的轮廓 Sketch 创建立体3D感知肖像的可能性。我们的关键发现是设计 Sketch aware 约束,充分利用基于三平面的3D感知生成模型的先前知识。具体来说,我们设计的区域感知体积渲染策略和全局一致性约束在 Sketch 编码期间进一步增强细节对应关系。此外,为了便于一般用户使用,我们提出了轮廓到 Sketch 模块,使用向量量化表示,使得易于绘制的轮廓可以直接影响3D肖像的生成。广泛的比较表明,我们的方法生成了与 Sketch 匹配高质量的结果。我们的使用研究证实了用户对我们系统的巨大偏好。

URL

https://arxiv.org/abs/2302.06857

PDF

https://arxiv.org/pdf/2302.06857.pdf
Read All
Explicitly Controllable 3D-Aware Portrait Generation

2022-09-12 17:40:08

Junshu Tang, Bo Zhang, Binxin Yang, Ting Zhang, Dong Chen, Lizhuang Ma, Fang Wen

arXiv_CV

arXiv_CV Image_Generation Portrait_Generation Face Pose 3D
Abstract

In contrast to the traditional avatar creation pipeline which is a costly process, contemporary generative approaches directly learn the data distribution from photographs and the state of the arts can now yield highly photo-realistic images. While plenty of works attempt to extend the unconditional generative models and achieve some level of controllability, it is still challenging to ensure multi-view consistency, especially in large poses. In this work, we propose a 3D portrait generation network that produces 3D consistent portraits while being controllable according to semantic parameters regarding pose, identity, expression and lighting. The generative network uses neural scene representation to model portraits in 3D, whose generation is guided by a parametric face model that supports explicit control. While the latent disentanglement can be further enhanced by contrasting images with partially different attributes, there still exists noticeable inconsistency in non-face areas, e.g., hair and background, when animating expressions. We solve this by proposing a volume blending strategy in which we form a composite output by blending the dynamic and static radiance fields, with two parts segmented from the jointly learned semantic field. Our method outperforms prior arts in extensive experiments, producing realistic portraits with vivid expression in natural lighting when viewed in free viewpoint. The proposed method also demonstrates generalization ability to real images as well as out-of-domain cartoon faces, showing great promise in real applications. Additional video results and code will be available on the project webpage.

Abstract (translated)

URL

https://arxiv.org/abs/2209.05434

PDF

https://arxiv.org/pdf/2209.05434.pdf
Read All
Injecting 3D Perception of Controllable NeRF-GAN into StyleGAN for Editable Portrait Image Synthesis

2022-07-21 01:41:54

Jeong-gi Kwak, Yuanming Li, Dongsik Yoon, Donghyeon Kim, David Han, Hanseok Ko

arXiv_CV

arXiv_CV GAN Image_Generation Portrait_Generation Unsupervised Pose 3D
Abstract

Over the years, 2D GANs have achieved great successes in photorealistic portrait generation. However, they lack 3D understanding in the generation process, thus they suffer from multi-view inconsistency problem. To alleviate the issue, many 3D-aware GANs have been proposed and shown notable results, but 3D GANs struggle with editing semantic attributes. The controllability and interpretability of 3D GANs have not been much explored. In this work, we propose two solutions to overcome these weaknesses of 2D GANs and 3D-aware GANs. We first introduce a novel 3D-aware GAN, SURF-GAN, which is capable of discovering semantic attributes during training and controlling them in an unsupervised manner. After that, we inject the prior of SURF-GAN into StyleGAN to obtain a high-fidelity 3D-controllable generator. Unlike existing latent-based methods allowing implicit pose control, the proposed 3D-controllable StyleGAN enables explicit pose control over portrait generation. This distillation allows direct compatibility between 3D control and many StyleGAN-based techniques (e.g., inversion and stylization), and also brings an advantage in terms of computational resources. Our codes are available at this https URL.

Abstract (translated)

URL

https://arxiv.org/abs/2207.10257

PDF

https://arxiv.org/pdf/2207.10257.pdf
Read All
Evolutionary latent space search for driving human portrait generation

2022-04-25 18:00:49

Benjamín Machín, Sergio Nesmachnow, Jamal Toutouh

arXiv_AI

arXiv_AI GAN Recognition Image_Generation Portrait_Generation Adversarial Face Face_Recognition Pose
Abstract

This article presents an evolutionary approach for synthetic human portraits generation based on the latent space exploration of a generative adversarial network. The idea is to produce different human face images very similar to a given target portrait. The approach applies StyleGAN2 for portrait generation and FaceNet for face similarity evaluation. The evolutionary search is based on exploring the real-coded latent space of StyleGAN2. The main results over both synthetic and real images indicate that the proposed approach generates accurate and diverse solutions, which represent realistic human portraits. The proposed research can contribute to improving the security of face recognition systems.

Abstract (translated)

URL

https://arxiv.org/abs/2204.11887

PDF

https://arxiv.org/pdf/2204.11887.pdf
Read All
Pastiche Master: Exemplar-Based High-Resolution Portrait Style Transfer

2022-03-24 17:57:11

Shuai Yang, Liming Jiang, Ziwei Liu, Chen Change Loy

arXiv_CV

arXiv_CV GAN Style_Transfer Image_Generation Portrait_Generation Face Transfer_Learning
Abstract

Recent studies on StyleGAN show high performance on artistic portrait generation by transfer learning with limited data. In this paper, we explore more challenging exemplar-based high-resolution portrait style transfer by introducing a novel DualStyleGAN with flexible control of dual styles of the original face domain and the extended artistic portrait domain. Different from StyleGAN, DualStyleGAN provides a natural way of style transfer by characterizing the content and style of a portrait with an intrinsic style path and a new extrinsic style path, respectively. The delicately designed extrinsic style path enables our model to modulate both the color and complex structural styles hierarchically to precisely pastiche the style example. Furthermore, a novel progressive fine-tuning scheme is introduced to smoothly transform the generative space of the model to the target domain, even with the above modifications on the network architecture. Experiments demonstrate the superiority of DualStyleGAN over state-of-the-art methods in high-quality portrait style transfer and flexible style control.

Abstract (translated)

URL

https://arxiv.org/abs/2203.13248

PDF

https://arxiv.org/pdf/2203.13248.pdf
Read All
DrawingInStyles: Portrait Image Generation and Editing with Spatially Conditioned StyleGAN

2022-03-05 14:54:07

Wanchao Su, Hui Ye, Shu-Yu Chen, Lin Gao, Hongbo Fu

arXiv_CV

arXiv_CV GAN Image_Generation Deep_Learning Portrait_Generation Face Quantitative Pose Sketch
Abstract

The research topic of sketch-to-portrait generation has witnessed a boost of progress with deep learning techniques. The recently proposed StyleGAN architectures achieve state-of-the-art generation ability but the original StyleGAN is not friendly for sketch-based creation due to its unconditional generation nature. To address this issue, we propose a direct conditioning strategy to better preserve the spatial information under the StyleGAN framework. Specifically, we introduce Spatially Conditioned StyleGAN (SC-StyleGAN for short), which explicitly injects spatial constraints to the original StyleGAN generation process. We explore two input modalities, sketches and semantic maps, which together allow users to express desired generation results more precisely and easily. Based on SC-StyleGAN, we present DrawingInStyles, a novel drawing interface for non-professional users to easily produce high-quality, photo-realistic face images with precise control, either from scratch or editing existing ones. Qualitative and quantitative evaluations show the superior generation ability of our method to existing and alternative solutions. The usability and expressiveness of our system are confirmed by a user study.

Abstract (translated)

URL

https://arxiv.org/abs/2203.02762

PDF

https://arxiv.org/pdf/2203.02762.pdf
Read All
Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation

2022-01-19 18:54:41

Xian Liu, Yinghao Xu, Qianyi Wu, Hang Zhou, Wayne Wu, Bolei Zhou

arXiv_SD

arXiv_SD Image_Generation Portrait_Generation Relation Pose Speech
Abstract

Animating high-fidelity video portrait with speech audio is crucial for virtual reality and digital entertainment. While most previous studies rely on accurate explicit structural information, recent works explore the implicit scene representation of Neural Radiance Fields (NeRF) for realistic generation. In order to capture the inconsistent motions as well as the semantic difference between human head and torso, some work models them via two individual sets of NeRF, leading to unnatural results. In this work, we propose Semantic-aware Speaking Portrait NeRF (SSP-NeRF), which creates delicate audio-driven portraits using one unified set of NeRF. The proposed model can handle the detailed local facial semantics and the global head-torso relationship through two semantic-aware modules. Specifically, we first propose a Semantic-Aware Dynamic Ray Sampling module with an additional parsing branch that facilitates audio-driven volume rendering. Moreover, to enable portrait rendering in one unified neural radiance field, a Torso Deformation module is designed to stabilize the large-scale non-rigid torso motions. Extensive evaluations demonstrate that our proposed approach renders more realistic video portraits compared to previous methods. Project page: this https URL

Abstract (translated)

URL

https://arxiv.org/abs/2201.07786

PDF

https://arxiv.org/pdf/2201.07786.pdf
Read All
MUSE: Illustrating Textual Attributes by Portrait Generation

2020-11-09 21:05:21

Xiaodan Hu, Pengfei Yu, Kevin Knight, Heng Ji, Bo Li, Honghui Shi

arXiv_CV

arXiv_CV Image_Generation Portrait_Generation Pose Emotion Reconstruction
Abstract

We propose a novel approach, MUSE, to illustrate textual attributes visually via portrait generation. MUSE takes a set of attributes written in text, in addition to facial features extracted from a photo of the subject as input. We propose 11 attribute types to represent inspirations from a subject's profile, emotion, story, and environment. We propose a novel stacked neural network architecture by extending an image-to-image generative model to accept textual attributes. Experiments show that our approach significantly outperforms several state-of-the-art methods without using textual attributes, with Inception Score score increased by 6% and Fréchet Inception Distance (FID) score decreased by 11%, respectively. We also propose a new attribute reconstruction metric to evaluate whether the generated portraits preserve the subject's attributes. Experiments show that our approach can accurately illustrate 78% textual attributes, which also help MUSE capture the subject in a more creative and expressive way.

Abstract (translated)

URL

https://arxiv.org/abs/2011.04761

PDF

https://arxiv.org/pdf/2011.04761.pdf
Read All

Content

Portrait_Generation (20)

Portrait_Generation

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL