Paper Reading AI Learner

Improving Skeleton-based Action Recognition with Interactive Object Information

2025-01-09 08:43:09
Hao Wen, Ziqian Lu, Fengli Shen, Zhe-Ming Lu, Jialin Cui

Abstract

Human skeleton information is important in skeleton-based action recognition, which provides a simple and efficient way to describe human pose. However, existing skeleton-based methods focus more on the skeleton, ignoring the objects interacting with humans, resulting in poor performance in recognizing actions that involve object interactions. We propose a new action recognition framework introducing object nodes to supplement absent interactive object information. We also propose Spatial Temporal Variable Graph Convolutional Networks (ST-VGCN) to effectively model the Variable Graph (VG) containing object nodes. Specifically, in order to validate the role of interactive object information, by leveraging a simple self-training approach, we establish a new dataset, JXGC 24, and an extended dataset, NTU RGB+D+Object 60, including more than 2 million additional object nodes. At the same time, we designe the Variable Graph construction method to accommodate a variable number of nodes for graph structure. Additionally, we are the first to explore the overfitting issue introduced by incorporating additional object information, and we propose a VG-based data augmentation method to address this issue, called Random Node Attack. Finally, regarding the network structure, we introduce two fusion modules, CAF and WNPool, along with a novel Node Balance Loss, to enhance the comprehensive performance by effectively fusing and balancing skeleton and object node information. Our method surpasses the previous state-of-the-art on multiple skeleton-based action recognition benchmarks. The accuracy of our method on NTU RGB+D 60 cross-subject split is 96.7\%, and on cross-view split, it is 99.2\%.

Abstract (translated)

人体骨骼信息在基于骨架的动作识别中非常重要,它提供了一种简单而有效的方式来描述人体姿态。然而,现有的基于骨架的方法更侧重于捕捉人的动作本身,忽视了与人互动的物体,导致在需要识别涉及物体交互的动作时表现不佳。为此,我们提出了一种新的动作识别框架,引入了对象节点来补充缺失的互动物信息。此外,我们还提出了空间时间可变图卷积网络(ST-VGCN),以有效地建模包含对象节点的可变图(VG)。 具体而言,为了验证交互物体信息的作用,通过简单的自我训练方法,我们建立了一个新的数据集JXGC 24以及一个扩展数据集NTU RGB+D+Object 60,这两个数据集中包含了超过两百万个额外的对象节点。同时,我们也设计了一种可变图构建方法以适应不同数量的节点以调整图结构。此外,我们首次探索了引入额外对象信息时出现的数据过拟合问题,并提出一种基于VG的数据增强方法来解决这一问题,即随机节点攻击(Random Node Attack)。 最后,在网络结构方面,我们提出了两个融合模块CAF(Cross Attention Fusion)和WNPool(Weighted Neighbor Pooling),以及一个新颖的节点平衡损失函数Node Balance Loss。这些措施通过有效地融合和平衡骨架与物体节点信息来增强系统的综合性能。 我们的方法在多个基于骨架的动作识别基准测试上超越了之前最先进的技术,具体来说,在NTU RGB+D 60跨主体分割上的准确率为96.7%,而在跨视角分割上的准确率则达到了99.2%。

URL

https://arxiv.org/abs/2501.05066

PDF

https://arxiv.org/pdf/2501.05066.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot