Paper Reading AI Learner

DSTSA-GCN: Advancing Skeleton-Based Gesture Recognition with Semantic-Aware Spatio-Temporal Topology Modeling

2025-01-21 12:28:36
Hu Cui, Renjing Huang, Ruoyu Zhang, Tessai Hayama

Abstract

Graph convolutional networks (GCNs) have emerged as a powerful tool for skeleton-based action and gesture recognition, thanks to their ability to model spatial and temporal dependencies in skeleton data. However, existing GCN-based methods face critical limitations: (1) they lack effective spatio-temporal topology modeling that captures dynamic variations in skeletal motion, and (2) they struggle to model multiscale structural relationships beyond local joint connectivity. To address these issues, we propose a novel framework called Dynamic Spatial-Temporal Semantic Awareness Graph Convolutional Network (DSTSA-GCN). DSTSA-GCN introduces three key modules: Group Channel-wise Graph Convolution (GC-GC), Group Temporal-wise Graph Convolution (GT-GC), and Multi-Scale Temporal Convolution (MS-TCN). GC-GC and GT-GC operate in parallel to independently model channel-specific and frame-specific correlations, enabling robust topology learning that accounts for temporal variations. Additionally, both modules employ a grouping strategy to adaptively capture multiscale structural relationships. Complementing this, MS-TCN enhances temporal modeling through group-wise temporal convolutions with diverse receptive fields. Extensive experiments demonstrate that DSTSA-GCN significantly improves the topology modeling capabilities of GCNs, achieving state-of-the-art performance on benchmark datasets for gesture and action recognition, including SHREC17 Track, DHG-14\/28, NTU-RGB+D, and NTU-RGB+D-120.

Abstract (translated)

图卷积网络(GCN)已成为基于骨架的动作和手势识别的强大工具,这得益于它们在骨架数据中建模空间和时间依赖关系的能力。然而,现有的基于GCN的方法面临关键限制:(1) 缺乏有效的时空拓扑模型以捕捉骨骼运动中的动态变化;(2) 难以建模超出局部关节连接的多尺度结构关系。为了解决这些问题,我们提出了一种名为动态空间-时间语义感知图卷积网络(DSTSA-GCN)的新框架。 DSTSA-GCN引入了三个关键模块:组通道式图卷积(GC-GC)、组时序式图卷积(GT-GC)和多尺度时序卷积(MS-TCN)。GC-GC和GT-GC以并行方式独立地建模特定于通道和帧的相关性,使拓扑学习能够适应时间变化,并且这两个模块都采用了分组策略来自适应捕捉多尺度结构关系。此外,MS-TCN通过具有不同感受野的组时序卷积增强了时序模型。 广泛的实验表明,DSTSA-GCN显著提高了GCN的拓扑建模能力,在包括SHREC17 Track、DHG-14/28、NTU-RGB+D和NTU-RGB+D-120在内的基准数据集上实现了动作和手势识别方面的最新性能。

URL

https://arxiv.org/abs/2501.12086

PDF

https://arxiv.org/pdf/2501.12086.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot