Abstract
Existing Transformers for monocular 3D human shape and pose estimation typically have a quadratic computation and memory complexity with respect to the feature length, which hinders the exploitation of fine-grained information in high-resolution features that is beneficial for accurate reconstruction. In this work, we propose an SMPL-based Transformer framework (SMPLer) to address this issue. SMPLer incorporates two key ingredients: a decoupled attention operation and an SMPL-based target representation, which allow effective utilization of high-resolution features in the Transformer. In addition, based on these two designs, we also introduce several novel modules including a multi-scale attention and a joint-aware attention to further boost the reconstruction performance. Extensive experiments demonstrate the effectiveness of SMPLer against existing 3D human shape and pose estimation methods both quantitatively and qualitatively. Notably, the proposed algorithm achieves an MPJPE of 45.2 mm on the Human3.6M dataset, improving upon Mesh Graphormer by more than 10% with fewer than one-third of the parameters. Code and pretrained models are available at this https URL.
Abstract (translated)
现有的单目3D人体形状和姿态估计器通常具有关于特征长度的二次计算和内存复杂度,这会阻碍在分辨率高的特征中发掘微小信息的有利程度,从而影响准确重建。在这项工作中,我们提出了一个基于SMPL的Transformer框架(SMPLer)来解决这一问题。SMPLer包括两个关键组件:分离的注意力操作和基于SMPL的目标表示,允许在Transformer中有效利用高分辨率特征。此外,根据这两个设计,我们还引入了几个新颖的模块,包括多尺度注意力和联合注意,以进一步提高重建性能。大量实验证明,SMPLer在现有的人体形状和姿态估计方法上具有有效性,无论是定量还是定性。值得注意的是,与Mesh Graphormer相比,所提出的算法在Human3.6M数据集上实现了更高的MPJPE,同时参数数量不到三分之一。代码和预训练模型可在此处访问的URL中获取。
URL
https://arxiv.org/abs/2404.15276