Abstract
Inferring 3D object structures from a single image is an ill-posed task due to depth ambiguity and occlusion. Typical resolutions in the literature include leveraging 2D or 3D ground truth for supervised learning, as well as imposing hand-crafted symmetry priors or using an implicit representation to hallucinate novel viewpoints for unsupervised methods. In this work, we propose a general adversarial learning framework for solving Unsupervised 2D to Explicit 3D Style Transfer (UE3DST). Specifically, we merge two architectures: the unsupervised explicit 3D reconstruction network of Wu et al.\ and the Generative Adversarial Network (GAN) named StarGAN-v2. We experiment across three facial datasets (Basel Face Model, 3DFAW and CelebA-HQ) and show that our solution is able to outperform well established solutions such as DepthNet in 3D reconstruction and Pix2NeRF in conditional style transfer, while we also justify the individual contributions of our model components via ablation. In contrast to the aforementioned baselines, our scheme produces features for explicit 3D rendering, which can be manipulated and utilized in downstream tasks.
Abstract (translated)
从一张图像推断3D物体结构是一个缺乏解决方案的任务,因为这涉及到深度歧义和遮挡。在文献中,通常的分辨率包括利用2D或3D基准 truth 进行监督学习,以及采用手工制定的对称前提或使用隐含表示来幻觉新的视角,以无监督方法为例。在本研究中,我们提出了一个通用的对抗学习框架,以解决无监督2D到 explicit 3D风格转移(UE3DST)问题。具体来说,我们将 Wu等人的无监督 explicit 3D重构网络和名为 StarGAN-v2的生成对抗网络合并。我们跨越三个面部数据集(Basel Face Model、3DFAW 和CelebA-HQ)进行实验,并表明我们的解决方案能够在3D重建和条件风格转移方面胜过良好确立的解决方案,如深度网络和 Pix2NeRF,同时我们也通过削除证明了我们模型组件的个人贡献。与上述基准相比,我们的方案生成了 explicit 3D渲染的特征,这些特征可以在后续任务中操纵和利用。
URL
https://arxiv.org/abs/2304.12455