Abstract
This paper presents the first-rank solution for the Multi-Modal Action Recognition Challenge, part of the Multi-Modal Visual Pattern Recognition Workshop at the \acl{ICPR} 2024. The competition aimed to recognize human actions using a diverse dataset of 20 action classes, collected from multi-modal sources. The proposed approach is built upon the \acl{TSM}, a technique aimed at efficiently capturing temporal dynamics in video data, incorporating multiple data input types. Our strategy included transfer learning to leverage pre-trained models, followed by meticulous fine-tuning on the challenge's specific dataset to optimize performance for the 20 action classes. We carefully selected a backbone network to balance computational efficiency and recognition accuracy and further refined the model using an ensemble technique that integrates outputs from different modalities. This ensemble approach proved crucial in boosting the overall performance. Our solution achieved a perfect top-1 accuracy on the test set, demonstrating the effectiveness of the proposed approach in recognizing human actions across 20 classes. Our code is available online this https URL.
Abstract (translated)
本文提出了在2024年ICPR多模态视觉模式识别研讨会的多模态动作识别挑战赛中的第一名解决方案。该竞赛旨在利用一个包含20个不同动作类别的多样化数据集,从多种来源收集的数据来识别人类的动作。所提出的方法基于时间分割模块(TSM)技术构建,该技术旨在高效地捕捉视频数据中的时间动态,并结合了多种类型的数据输入。 我们的策略包括使用迁移学习来利用预训练模型,并在挑战赛特定数据集上进行精细调整以优化对20个动作类别的性能。我们仔细选择了骨干网络,在计算效率和识别准确性之间取得了平衡,进一步通过集成技术来细化模型,该技术整合了来自不同模态的输出。这种集成方法对于提高整体性能至关重要。 我们的解决方案在测试集上达到了完美的顶级准确率(top-1 accuracy),证明了所提出的方案能够有效地跨20个类别识别人类动作。代码可在以下网址获取:[此链接应为实际可访问的具体URL,原文中的“this https URL”是占位符]。
URL
https://arxiv.org/abs/2501.17550