Abstract
Graph convolutional networks (GCNs) have emerged as a powerful tool for skeleton-based action and gesture recognition, thanks to their ability to model spatial and temporal dependencies in skeleton data. However, existing GCN-based methods face critical limitations: (1) they lack effective spatio-temporal topology modeling that captures dynamic variations in skeletal motion, and (2) they struggle to model multiscale structural relationships beyond local joint connectivity. To address these issues, we propose a novel framework called Dynamic Spatial-Temporal Semantic Awareness Graph Convolutional Network (DSTSA-GCN). DSTSA-GCN introduces three key modules: Group Channel-wise Graph Convolution (GC-GC), Group Temporal-wise Graph Convolution (GT-GC), and Multi-Scale Temporal Convolution (MS-TCN). GC-GC and GT-GC operate in parallel to independently model channel-specific and frame-specific correlations, enabling robust topology learning that accounts for temporal variations. Additionally, both modules employ a grouping strategy to adaptively capture multiscale structural relationships. Complementing this, MS-TCN enhances temporal modeling through group-wise temporal convolutions with diverse receptive fields. Extensive experiments demonstrate that DSTSA-GCN significantly improves the topology modeling capabilities of GCNs, achieving state-of-the-art performance on benchmark datasets for gesture and action recognition, including SHREC17 Track, DHG-14\/28, NTU-RGB+D, and NTU-RGB+D-120.
Abstract (translated)
图卷积网络(GCN)已成为基于骨架的动作和手势识别的强大工具,这得益于它们在骨架数据中建模空间和时间依赖关系的能力。然而,现有的基于GCN的方法面临关键限制:(1) 缺乏有效的时空拓扑模型以捕捉骨骼运动中的动态变化;(2) 难以建模超出局部关节连接的多尺度结构关系。为了解决这些问题,我们提出了一种名为动态空间-时间语义感知图卷积网络(DSTSA-GCN)的新框架。 DSTSA-GCN引入了三个关键模块:组通道式图卷积(GC-GC)、组时序式图卷积(GT-GC)和多尺度时序卷积(MS-TCN)。GC-GC和GT-GC以并行方式独立地建模特定于通道和帧的相关性,使拓扑学习能够适应时间变化,并且这两个模块都采用了分组策略来自适应捕捉多尺度结构关系。此外,MS-TCN通过具有不同感受野的组时序卷积增强了时序模型。 广泛的实验表明,DSTSA-GCN显著提高了GCN的拓扑建模能力,在包括SHREC17 Track、DHG-14/28、NTU-RGB+D和NTU-RGB+D-120在内的基准数据集上实现了动作和手势识别方面的最新性能。
URL
https://arxiv.org/abs/2501.12086