Paper Reading AI Learner

Multi-person Articulated Tracking with Spatial and Temporal Embeddings

2019-03-21 19:42:27
Sheng Jin, Wentao Liu, Wanli Ouyang, Chen Qian

Abstract

We propose a unified framework for multi-person pose estimation and tracking. Our framework consists of two main components,~\ie~SpatialNet and TemporalNet. The SpatialNet accomplishes body part detection and part-level data association in a single frame, while the TemporalNet groups human instances in consecutive frames into trajectories. Specifically, besides body part detection heatmaps, SpatialNet also predicts the Keypoint Embedding (KE) and Spatial Instance Embedding (SIE) for body part association. We model the grouping procedure into a differentiable Pose-Guided Grouping (PGG) module to make the whole part detection and grouping pipeline fully end-to-end trainable. TemporalNet extends spatial grouping of keypoints to temporal grouping of human instances. Given human proposals from two consecutive frames, TemporalNet exploits both appearance features encoded in Human Embedding (HE) and temporally consistent geometric features embodied in Temporal Instance Embedding (TIE) for robust tracking. Extensive experiments demonstrate the effectiveness of our proposed model. Remarkably, we demonstrate substantial improvements over the state-of-the-art pose tracking method from 65.4\% to 71.8\% Multi-Object Tracking Accuracy (MOTA) on the ICCV'17 PoseTrack Dataset.

Abstract (translated)

我们提出了一个统一的多人姿态估计和跟踪框架。我们的框架由两个主要组件组成,即~ie~空间网和临时网。空间网在一个帧内完成体部检测和部分级数据关联,而时间网将连续帧内的人类实例分组为轨迹。具体来说,除了身体部位检测热图,空间网还预测了身体部位关联的关键点嵌入(ke)和空间实例嵌入(sie)。我们将分组过程建模为一个可微分的姿势引导分组(PGG)模块,使整个零件检测和分组管道完全可以端到端训练。时间网将关键点的空间分组扩展到人类实例的时间分组。考虑到来自两个连续帧的人类建议,TemporalNet利用人类嵌入(HE)中编码的外观特征和时间实例嵌入(TIE)中包含的时间一致的几何特征进行可靠跟踪。大量实验证明了该模型的有效性。值得注意的是,我们证明了在ICCV17 posetrack数据集上,与最先进的姿态跟踪方法相比,该方法有了实质性的改进,从65.4%提高到71.8%。

URL

https://arxiv.org/abs/1903.09214

PDF

https://arxiv.org/pdf/1903.09214.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot