Multi-person Articulated Tracking with Spatial and Temporal Embeddings

Abstract
Abstract (translated)
URL
PDF

Abstract

We propose a unified framework for multi-person pose estimation and tracking. Our framework consists of two main components,~\ie~SpatialNet and TemporalNet. The SpatialNet accomplishes body part detection and part-level data association in a single frame, while the TemporalNet groups human instances in consecutive frames into trajectories. Specifically, besides body part detection heatmaps, SpatialNet also predicts the Keypoint Embedding (KE) and Spatial Instance Embedding (SIE) for body part association. We model the grouping procedure into a differentiable Pose-Guided Grouping (PGG) module to make the whole part detection and grouping pipeline fully end-to-end trainable. TemporalNet extends spatial grouping of keypoints to temporal grouping of human instances. Given human proposals from two consecutive frames, TemporalNet exploits both appearance features encoded in Human Embedding (HE) and temporally consistent geometric features embodied in Temporal Instance Embedding (TIE) for robust tracking. Extensive experiments demonstrate the effectiveness of our proposed model. Remarkably, we demonstrate substantial improvements over the state-of-the-art pose tracking method from 65.4\% to 71.8\% Multi-Object Tracking Accuracy (MOTA) on the ICCV'17 PoseTrack Dataset.

Abstract (translated)

我们提出了一个统一的多人姿态估计和跟踪框架。我们的框架由两个主要组件组成，即~ie~空间网和临时网。空间网在一个帧内完成体部检测和部分级数据关联，而时间网将连续帧内的人类实例分组为轨迹。具体来说，除了身体部位检测热图，空间网还预测了身体部位关联的关键点嵌入（ke）和空间实例嵌入（sie）。我们将分组过程建模为一个可微分的姿势引导分组（PGG）模块，使整个零件检测和分组管道完全可以端到端训练。时间网将关键点的空间分组扩展到人类实例的时间分组。考虑到来自两个连续帧的人类建议，TemporalNet利用人类嵌入（HE）中编码的外观特征和时间实例嵌入（TIE）中包含的时间一致的几何特征进行可靠跟踪。大量实验证明了该模型的有效性。值得注意的是，我们证明了在ICCV17 posetrack数据集上，与最先进的姿态跟踪方法相比，该方法有了实质性的改进，从65.4%提高到71.8%。

URL

https://arxiv.org/abs/1903.09214

PDF

https://arxiv.org/pdf/1903.09214.pdf