Abstract
Human skeleton information is important in skeleton-based action recognition, which provides a simple and efficient way to describe human pose. However, existing skeleton-based methods focus more on the skeleton, ignoring the objects interacting with humans, resulting in poor performance in recognizing actions that involve object interactions. We propose a new action recognition framework introducing object nodes to supplement absent interactive object information. We also propose Spatial Temporal Variable Graph Convolutional Networks (ST-VGCN) to effectively model the Variable Graph (VG) containing object nodes. Specifically, in order to validate the role of interactive object information, by leveraging a simple self-training approach, we establish a new dataset, JXGC 24, and an extended dataset, NTU RGB+D+Object 60, including more than 2 million additional object nodes. At the same time, we designe the Variable Graph construction method to accommodate a variable number of nodes for graph structure. Additionally, we are the first to explore the overfitting issue introduced by incorporating additional object information, and we propose a VG-based data augmentation method to address this issue, called Random Node Attack. Finally, regarding the network structure, we introduce two fusion modules, CAF and WNPool, along with a novel Node Balance Loss, to enhance the comprehensive performance by effectively fusing and balancing skeleton and object node information. Our method surpasses the previous state-of-the-art on multiple skeleton-based action recognition benchmarks. The accuracy of our method on NTU RGB+D 60 cross-subject split is 96.7\%, and on cross-view split, it is 99.2\%.
Abstract (translated)
人体骨骼信息在基于骨架的动作识别中非常重要,它提供了一种简单而有效的方式来描述人体姿态。然而,现有的基于骨架的方法更侧重于捕捉人的动作本身,忽视了与人互动的物体,导致在需要识别涉及物体交互的动作时表现不佳。为此,我们提出了一种新的动作识别框架,引入了对象节点来补充缺失的互动物信息。此外,我们还提出了空间时间可变图卷积网络(ST-VGCN),以有效地建模包含对象节点的可变图(VG)。 具体而言,为了验证交互物体信息的作用,通过简单的自我训练方法,我们建立了一个新的数据集JXGC 24以及一个扩展数据集NTU RGB+D+Object 60,这两个数据集中包含了超过两百万个额外的对象节点。同时,我们也设计了一种可变图构建方法以适应不同数量的节点以调整图结构。此外,我们首次探索了引入额外对象信息时出现的数据过拟合问题,并提出一种基于VG的数据增强方法来解决这一问题,即随机节点攻击(Random Node Attack)。 最后,在网络结构方面,我们提出了两个融合模块CAF(Cross Attention Fusion)和WNPool(Weighted Neighbor Pooling),以及一个新颖的节点平衡损失函数Node Balance Loss。这些措施通过有效地融合和平衡骨架与物体节点信息来增强系统的综合性能。 我们的方法在多个基于骨架的动作识别基准测试上超越了之前最先进的技术,具体来说,在NTU RGB+D 60跨主体分割上的准确率为96.7%,而在跨视角分割上的准确率则达到了99.2%。
URL
https://arxiv.org/abs/2501.05066