Paper Reading AI Learner

Action Recognition Using Temporal Shift Module and Ensemble Learning

2025-01-29 10:36:55
Anh-Kiet Duong, Petra Gomez-Kr\"amer

Abstract

This paper presents the first-rank solution for the Multi-Modal Action Recognition Challenge, part of the Multi-Modal Visual Pattern Recognition Workshop at the \acl{ICPR} 2024. The competition aimed to recognize human actions using a diverse dataset of 20 action classes, collected from multi-modal sources. The proposed approach is built upon the \acl{TSM}, a technique aimed at efficiently capturing temporal dynamics in video data, incorporating multiple data input types. Our strategy included transfer learning to leverage pre-trained models, followed by meticulous fine-tuning on the challenge's specific dataset to optimize performance for the 20 action classes. We carefully selected a backbone network to balance computational efficiency and recognition accuracy and further refined the model using an ensemble technique that integrates outputs from different modalities. This ensemble approach proved crucial in boosting the overall performance. Our solution achieved a perfect top-1 accuracy on the test set, demonstrating the effectiveness of the proposed approach in recognizing human actions across 20 classes. Our code is available online this https URL.

Abstract (translated)

本文提出了在2024年ICPR多模态视觉模式识别研讨会的多模态动作识别挑战赛中的第一名解决方案。该竞赛旨在利用一个包含20个不同动作类别的多样化数据集,从多种来源收集的数据来识别人类的动作。所提出的方法基于时间分割模块(TSM)技术构建,该技术旨在高效地捕捉视频数据中的时间动态,并结合了多种类型的数据输入。 我们的策略包括使用迁移学习来利用预训练模型,并在挑战赛特定数据集上进行精细调整以优化对20个动作类别的性能。我们仔细选择了骨干网络,在计算效率和识别准确性之间取得了平衡,进一步通过集成技术来细化模型,该技术整合了来自不同模态的输出。这种集成方法对于提高整体性能至关重要。 我们的解决方案在测试集上达到了完美的顶级准确率(top-1 accuracy),证明了所提出的方案能够有效地跨20个类别识别人类动作。代码可在以下网址获取:[此链接应为实际可访问的具体URL,原文中的“this https URL”是占位符]。

URL

https://arxiv.org/abs/2501.17550

PDF

https://arxiv.org/pdf/2501.17550.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot