Abstract
Despite recent progress in 3D Gaussian-based head avatar modeling, efficiently generating high fidelity avatars remains a challenge. Current methods typically rely on extensive multi-view capture setups or monocular videos with per-identity optimization during inference, limiting their scalability and ease of use on unseen subjects. To overcome these efficiency drawbacks, we propose \OURS, a feed-forward method to generate high-quality Gaussian head avatars from only a few input images while supporting real-time animation. Our approach directly learns a per-pixel Gaussian representation from the input images, and aggregates multi-view information using a transformer-based encoder that fuses image features from both DINOv3 and Stable Diffusion VAE. For real-time animation, we extend the explicit Gaussian representations with per-Gaussian features and introduce a lightweight MLP-based dynamic network to predict 3D Gaussian deformations from expression codes. Furthermore, to enhance geometric smoothness of the 3D head, we employ point maps from a pre-trained large reconstruction model as geometry supervision. Experiments show that our approach significantly outperforms existing methods in both rendering quality and inference efficiency, while supporting real-time dynamic avatar animation.
Abstract (translated)
尽管基于3D高斯模型的头像建模在最近取得了进展,但高效生成高质量的化身仍然是一项挑战。当前的方法通常依赖于复杂的多视角捕捉设备或单目视频,并且在推理过程中需要针对每个身份进行个性化的优化,这限制了它们在未见过的主题上的可扩展性和易用性。为了克服这些效率问题,我们提出了一种名为\OURS的方法,这是一种前馈方法,可以从少量输入图像中生成高质量的高斯头像,并支持实时动画效果。 我们的方法直接从输入图像中学习每个像素的高斯表示,并利用基于Transformer的编码器聚合多视角信息,该编码器融合了DINOv3和Stable Diffusion VAE中的图像特征。为了实现实时动画,我们通过为每种高斯分布添加特定属性的特征来扩展显式的高斯表示,并引入了一个轻量级的MLP(多层感知机)网络以从表情代码中预测3D高斯变形。 此外,为了增强三维头部的几何平滑度,我们利用了预训练的大规模重建模型提供的点映射作为几何指导。实验表明,我们的方法在渲染质量和推理效率方面均显著优于现有技术,并且支持实时动态头像动画效果。
URL
https://arxiv.org/abs/2601.13837