Abstract
Facial expression and hand motions are necessary to express our emotions and interact with the world. Nevertheless, most of the 3D human avatars modeled from a casually captured video only support body motions without facial expressions and hand this http URL this work, we present ExAvatar, an expressive whole-body 3D human avatar learned from a short monocular video. We design ExAvatar as a combination of the whole-body parametric mesh model (SMPL-X) and 3D Gaussian Splatting (3DGS). The main challenges are 1) a limited diversity of facial expressions and poses in the video and 2) the absence of 3D observations, such as 3D scans and RGBD images. The limited diversity in the video makes animations with novel facial expressions and poses non-trivial. In addition, the absence of 3D observations could cause significant ambiguity in human parts that are not observed in the video, which can result in noticeable artifacts under novel motions. To address them, we introduce our hybrid representation of the mesh and 3D Gaussians. Our hybrid representation treats each 3D Gaussian as a vertex on the surface with pre-defined connectivity information (i.e., triangle faces) between them following the mesh topology of SMPL-X. It makes our ExAvatar animatable with novel facial expressions by driven by the facial expression space of SMPL-X. In addition, by using connectivity-based regularizers, we significantly reduce artifacts in novel facial expressions and poses.
Abstract (translated)
面部表情和手部动作是我们表达情感和与外界互动的必要手段。然而,从随意捕捉的视频中提取的3D人类头像模型大部分仅支持身体动作,而没有面部表情和手部动作。针对这个问题,我们提出了ExAvatar,一种从短视视频中学到的富有表现力的全身3D人类头像。我们将ExAvatar设计为整身体积参数网格模型(SMPL-X)和3D高斯展平(3DGS)的结合。主要挑战是1)视频中面部表情和姿势的多样性有限,2)缺乏3D观察,如3D扫描和RGBD图像。视频中的有限多样性使得具有新颖面部表情和姿势的动画变得非寻常困难。此外,缺乏3D观察可能导致在不观察到的部位出现显著的模糊,从而在新型动作下产生明显的伪影。为了应对这些问题,我们引入了我们的网格和3D高斯的中值表示。我们的中值表示将每个3D高斯视为表面上的顶点,并具有预定义的连接信息(即三角形面)连接它们,遵循SMPL-X的网格拓扑结构。它使得我们的ExAvatar通过SMPL-X面部表情空间驱动具有新颖面部表情。此外,通过使用基于连通性的正则化方法,我们显著减少了新型面部表情和姿势中的伪影。
URL
https://arxiv.org/abs/2407.21686