Abstract
Multimodal-based action recognition methods have achieved high success using pose and RGB modality. However, skeletons sequences lack appearance depiction and RGB images suffer irrelevant noise due to modality limitations. To address this, we introduce human parsing feature map as a novel modality, since it can selectively retain effective semantic features of the body parts, while filtering out most irrelevant noise. We propose a new dual-branch framework called Ensemble Human Parsing and Pose Network (EPP-Net), which is the first to leverage both skeletons and human parsing modalities for action recognition. The first human pose branch feeds robust skeletons in graph convolutional network to model pose features, while the second human parsing branch also leverages depictive parsing feature maps to model parsing festures via convolutional backbones. The two high-level features will be effectively combined through a late fusion strategy for better action recognition. Extensive experiments on NTU RGB+D and NTU RGB+D 120 benchmarks consistently verify the effectiveness of our proposed EPP-Net, which outperforms the existing action recognition methods. Our code is available at: this https URL.
Abstract (translated)
多模态动作识别方法通过姿态和RGB模态取得了高度的成功。然而,骨架序列缺乏外观描述,RGB图像因模态限制而受到无关噪声的影响。为了解决这个问题,我们引入了人类解析特征图作为一种新的模式,因为它可以选择性地保留身体部位的有效语义特征,同时过滤出大多数无关噪声。我们提出了一个名为Ensemble Human Parsing and Pose Network (EPP-Net)的新双分支框架,它是第一个利用骨架和人类解析模态进行动作识别的。第一个人体姿态分支将稳健的骨架输入到图卷积网络中以建模姿态特征,而第二个人体解析分支则利用表示性卷积后端利用表示性卷积特征图建模通过表示性后端进行解析特征。通过晚融合策略将两个高级特征有效地结合以提高动作识别效果。在NTU RGB+D和NTU RGB+D 120基准测试中,我们进行了广泛的实验,验证了我们提出的EPP-Net的有效性,该方法超越了现有的动作识别方法。我们的代码可以从以下链接获得:https://this URL。
URL
https://arxiv.org/abs/2401.02138