Paper Reading AI Learner

Learning Temporal Pose Estimation from Sparsely-Labeled Videos

2019-06-06 21:24:52
Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani

Abstract

Modern approaches for multi-person pose estimation in video require large amounts of dense annotations. However, labeling every frame in a video is costly and labor intensive. To reduce the need for dense annotations, we propose a PoseWarper network that leverages training videos with sparse annotations (every k frames) to learn to perform dense temporal pose propagation and estimation. Given a pair of video frames---a labeled Frame A and an unlabeled Frame B---we train our model to predict human pose in Frame A using the features from Frame B by means of deformable convolutions to implicitly learn the pose warping between A and B. We demonstrate that we can leverage our trained PoseWarper for several applications. First, at inference time we can reverse the application direction of our network in order to propagate pose information from manually annotated frames to unlabeled frames. This makes it possible to generate pose annotations for the entire video given only a few manually-labeled frames. Compared to modern label propagation methods based on optical flow, our warping mechanism is much more compact (6M vs 39M parameters), and also more accurate (88.7% mAP vs 83.8% mAP). We also show that we can improve the accuracy of a pose estimator by training it on an augmented dataset obtained by adding our propagated poses to the original manual labels. Lastly, we can use our PoseWarper to aggregate temporal pose information from neighboring frames during inference. This allows our system to achieve state-of-the-art pose detection results on the PoseTrack2017 dataset.

Abstract (translated)

现代的视频多人姿态估计方法需要大量密集的注释。然而,给视频中的每一帧贴标签都是昂贵的,而且是劳动密集型的。为了减少对密集注释的需求,我们提出了一个posewarper网络,它利用训练视频和稀疏注释(每k帧)来学习执行密集的时间姿势传播和估计。给出了一对视频帧——一个有标签的帧A和一个没有标签的帧B——我们训练我们的模型,利用帧B的特征,通过可变形的卷积来预测帧A中的人体姿势,从而隐式地学习a和b之间的姿势扭曲。我们证明我们可以利用我们训练过的posewarper来实现几个应用。首先,在推理时,我们可以反转网络的应用方向,以便将姿势信息从手动注释帧传播到未标记帧。这使得只给几个手动标记的帧就可以为整个视频生成姿势注释。与基于光流的现代标签传播方法相比,我们的翘曲机制更紧凑(6米对39米参数),而且更精确(88.7%的地图对83.8%的地图)。我们还表明,我们可以通过将传播的姿态添加到原始的手动标签上,将其训练到一个增强的数据集上,从而提高姿态估计的准确性。最后,我们可以使用posewarper在推断过程中从相邻帧聚合时间姿态信息。这使得我们的系统能够在posetrack2017数据集上实现最先进的姿势检测结果。

URL

https://arxiv.org/abs/1906.04016

PDF

https://arxiv.org/pdf/1906.04016.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot