Abstract
With the development of neural radiance fields and generative models, numerous methods have been proposed for learning 3D human generation from 2D images. These methods allow control over the pose of the generated 3D human and enable rendering from different viewpoints. However, none of these methods explore semantic disentanglement in human image synthesis, i.e., they can not disentangle the generation of different semantic parts, such as the body, tops, and bottoms. Furthermore, existing methods are limited to synthesize images at $512^2$ resolution due to the high computational cost of neural radiance fields. To address these limitations, we introduce SemanticHuman-HD, the first method to achieve semantic disentangled human image synthesis. Notably, SemanticHuman-HD is also the first method to achieve 3D-aware image synthesis at $1024^2$ resolution, benefiting from our proposed 3D-aware super-resolution module. By leveraging the depth maps and semantic masks as guidance for the 3D-aware super-resolution, we significantly reduce the number of sampling points during volume rendering, thereby reducing the computational cost. Our comparative experiments demonstrate the superiority of our method. The effectiveness of each proposed component is also verified through ablation studies. Moreover, our method opens up exciting possibilities for various applications, including 3D garment generation, semantic-aware image synthesis, controllable image synthesis, and out-of-domain image synthesis.
Abstract (translated)
随着神经元辐射场和生成模型的不断发展,已经提出了许多从2D图像中学习3D人类生成的方法。这些方法允许控制生成3D人类的角度,并能够从不同的角度进行渲染。然而,这些方法都没有探索人图像合成中的语义解离,即它们无法分离生成不同语义部分,如身体、顶部 和底部。此外,由于神经元辐射场的高计算成本,现有方法仅能在$512^2$的分辨率上合成图像。为了克服这些限制,我们引入了 SemanticHuman-HD,这是第一个实现语义解离的人图像合成方法。值得注意的是,SemanticHuman-HD 也是第一个在$1024^2$的分辨率上实现3D意识图像生成的方法,得益于我们提出的3D意识超分辨率模块。通过利用深度图和语义掩码作为3D意识超分辨率的有指导,我们在体积渲染过程中显著减少了抽样点数,从而降低了计算成本。我们的比较实验证实了我们的方法具有优越性。每个所提出的组件的有效性也通过消融实验得到了验证。此外,我们的方法为各种应用开辟了令人兴奋的领域,包括3D衣物的生成、语义意识图像生成、可控制图像生成和跨域图像生成。
URL
https://arxiv.org/abs/2403.10166