Abstract
Expressive Human Pose and Shape Estimation (EHPS) aims to jointly estimate human pose, hand gesture, and facial expression from monocular images. Existing methods predominantly rely on Transformer-based architectures, which suffer from quadratic complexity in self-attention, leading to substantial computational overhead, especially in multi-person scenarios. Recently, Mamba has emerged as a promising alternative to Transformers due to its efficient global modeling capability. However, it remains limited in capturing fine-grained local dependencies, which are essential for precise EHPS. To address these issues, we propose EMO-X, the Efficient Multi-person One-stage model for multi-person EHPS. Specifically, we explore a Scan-based Global-Local Decoder (SGLD) that integrates global context with skeleton-aware local features to iteratively enhance human tokens. Our EMO-X leverages the superior global modeling capability of Mamba and designs a local bidirectional scan mechanism for skeleton-aware local refinement. Comprehensive experiments demonstrate that EMO-X strikes an excellent balance between efficiency and accuracy. Notably, it achieves a significant reduction in computational complexity, requiring 69.8% less inference time compared to state-of-the-art (SOTA) methods, while outperforming most of them in accuracy.
Abstract (translated)
表达式人体姿态和形状估计(EHPS)的目标是从单目图像中联合估计人体姿势、手势以及面部表情。现有的方法主要依赖于基于Transformer的架构,但自注意力机制在多个人体场景中的复杂度为二次方级,导致了巨大的计算开销。最近,Mamba作为一种替代Transformer的有效全局建模工具崭露头角,但由于其难以捕捉精细局部依赖性(这在精确EHPS中至关重要),因此仍有局限性。为了克服这些问题,我们提出了EMO-X——一种用于多个人体EHPS的高效单阶段模型。 具体而言,我们探索了一种基于扫描的全局-局部解码器(SGLD),该解码器结合了全球上下文和骨骼感知局部特征,以迭代增强人类令牌。我们的EMO-X利用Mamba优秀的全局建模能力,并设计了一个针对骨骼感知局部细化的双向扫描机制。 综合实验表明,EMO-X在效率与准确性之间取得了卓越的平衡。尤其值得注意的是,它显著减少了计算复杂度,在推理时间上比最先进的(SOTA)方法缩短了69.8%,同时在精度上超越了许多同类方法。
URL
https://arxiv.org/abs/2504.08718