Paper Reading AI Learner

Mutual Information-Based Temporal Difference Learning for Human Pose Estimation in Video

2023-03-15 09:29:03
Runyang Feng, Yixing Gao, Xueqing Ma, Tze Ho Elden Tse, Hyung Jin Chang

Abstract

Temporal modeling is crucial for multi-frame human pose estimation. Most existing methods directly employ optical flow or deformable convolution to predict full-spectrum motion fields, which might incur numerous irrelevant cues, such as a nearby person or background. Without further efforts to excavate meaningful motion priors, their results are suboptimal, especially in complicated spatiotemporal interactions. On the other hand, the temporal difference has the ability to encode representative motion information which can potentially be valuable for pose estimation but has not been fully exploited. In this paper, we present a novel multi-frame human pose estimation framework, which employs temporal differences across frames to model dynamic contexts and engages mutual information objectively to facilitate useful motion information disentanglement. To be specific, we design a multi-stage Temporal Difference Encoder that performs incremental cascaded learning conditioned on multi-stage feature difference sequences to derive informative motion representation. We further propose a Representation Disentanglement module from the mutual information perspective, which can grasp discriminative task-relevant motion signals by explicitly defining useful and noisy constituents of the raw motion features and minimizing their mutual information. These place us to rank No.1 in the Crowd Pose Estimation in Complex Events Challenge on benchmark dataset HiEve, and achieve state-of-the-art performance on three benchmarks PoseTrack2017, PoseTrack2018, and PoseTrack21.

Abstract (translated)

时间建模对于多帧人类姿态估计至关重要。大多数现有方法直接使用光学流或可变形卷积预测全光谱运动场,这可能会引入许多无关的线索,例如附近的人或背景。如果没有进一步挖掘有意义的运动先验,它们的结果就不太可能达到最优,特别是在复杂的时空交互中。另一方面,时间差异具有编码代表运动信息的能力,这些信息可能对姿态估计非常有价值,但尚未得到充分利用。在本文中,我们提出了一种新的多帧人类姿态估计框架,该框架使用每个帧之间的时间差异来建模动态上下文,并客观参与互信息以促进有用的运动信息分离。具体来说,我们设计了一个多级时间差异编码器,在多级特征差异序列的逐步迭代学习中产生有用的运动表示。我们还提出了一种表示分离模块,可以从互信息的角度提出,可以明确定义原始运动特征有用的噪声成分和最小化它们的互信息。这些使我们能够在复杂事件挑战中通过HiEve基准数据集获得复杂事件姿态估计任务中的最佳排名,并在三项基准数据集PoseTrack2017、PoseTrack2018和PoseTrack21上实现最先进的表现。

URL

https://arxiv.org/abs/2303.08475

PDF

https://arxiv.org/pdf/2303.08475.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot