Paper Reading AI Learner

HDBN: A Novel Hybrid Dual-branch Network for Robust Skeleton-based Action Recognition

2024-04-24 08:11:50
Jinfu Liu, Baiqiao Yin, Jiaying Lin, Jiajun Wen, Yue Li, Mengyuan Liu

Abstract

Skeleton-based action recognition has gained considerable traction thanks to its utilization of succinct and robust skeletal representations. Nonetheless, current methodologies often lean towards utilizing a solitary backbone to model skeleton modality, which can be limited by inherent flaws in the network backbone. To address this and fully leverage the complementary characteristics of various network architectures, we propose a novel Hybrid Dual-Branch Network (HDBN) for robust skeleton-based action recognition, which benefits from the graph convolutional network's proficiency in handling graph-structured data and the powerful modeling capabilities of Transformers for global information. In detail, our proposed HDBN is divided into two trunk branches: MixGCN and MixFormer. The two branches utilize GCNs and Transformers to model both 2D and 3D skeletal modalities respectively. Our proposed HDBN emerged as one of the top solutions in the Multi-Modal Video Reasoning and Analyzing Competition (MMVRAC) of 2024 ICME Grand Challenge, achieving accuracies of 47.95% and 75.36% on two benchmarks of the UAV-Human dataset by outperforming most existing methods. Our code will be publicly available at: this https URL.

Abstract (translated)

基于骨架的动作识别因为利用了简洁且鲁棒的骨架表示而获得了相当大的关注。然而,当前的方法通常倾向于使用单一的骨架来建模骨架模式,这可能会受到网络骨架固有缺陷的限制。为了解决这个问题,并充分利用各种网络架构的互补特点,我们提出了一种新颖的混合双分支网络(HDBN),用于鲁棒骨架 based 动作识别,该网络从图卷积网络的擅长处理图形数据和Transformer的强大的建模能力中受益。具体而言,我们提出的 HDBN 分为两个主分支:MixGCN 和 MixFormer。这两个分支分别使用 GCN 和 Transformer 建模 2D 和 3D 骨架模式。我们提出的 HDBN 在 2024 ICME Grand Challenge 多模态视频推理和分析比赛中成为了一个一流解决方案,在两个 UAV-Human 数据集的基准上实现了准确度分别为 47.95% 和 75.36%,超过了大多数现有方法。我们的代码将在这个链接上公开发布:https://this URL。

URL

https://arxiv.org/abs/2404.15719

PDF

https://arxiv.org/pdf/2404.15719.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot