ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos

Abstract
Abstract (translated)
URL
PDF

Abstract

Human action or activity recognition in videos is a fundamental task in computer vision with applications in surveillance and monitoring, self-driving cars, sports analytics, human-robot interaction and many more. Traditional supervised methods require large annotated datasets for training, which are expensive and time-consuming to acquire. This work proposes a novel approach using Cross-Architecture Pseudo-Labeling with contrastive learning for semi-supervised action recognition. Our framework leverages both labeled and unlabelled data to robustly learn action representations in videos, combining pseudo-labeling with contrastive learning for effective learning from both types of samples. We introduce a novel cross-architecture approach where 3D Convolutional Neural Networks (3D CNNs) and video transformers (VIT) are utilised to capture different aspects of action representations; hence we call it ActNetFormer. The 3D CNNs excel at capturing spatial features and local dependencies in the temporal domain, while VIT excels at capturing long-range dependencies across frames. By integrating these complementary architectures within the ActNetFormer framework, our approach can effectively capture both local and global contextual information of an action. This comprehensive representation learning enables the model to achieve better performance in semi-supervised action recognition tasks by leveraging the strengths of each of these architectures. Experimental results on standard action recognition datasets demonstrate that our approach performs better than the existing methods, achieving state-of-the-art performance with only a fraction of labeled data. The official website of this work is available at: this https URL.

Abstract (translated)

视频中的人类动作或活动识别是一个在计算机视觉领域的基本任务，应用于监视和监控、自动驾驶汽车、运动分析、人机交互等领域。传统的监督方法需要大量的标注数据进行训练，这些数据耗费时间和金钱。本文提出了一种新颖的方法，使用交叉架构伪标签与对比学习进行半监督动作识别。我们的框架利用已标注和未标注数据来稳健地学习视频中的动作表示，将伪标签与对比学习相结合，实现从两种类型的样本中有效地学习动作表示。我们引入了一种新颖的跨架构方法，其中3D卷积神经网络（3D CNN）和视频转换器（VIT）被用于捕捉不同动作表示的各个方面；因此我们称之为ActNetFormer。3D CNN在捕捉时间域中的空间特征和局部依赖方面表现出色，而VIT在捕捉帧之间的长距离依赖方面表现出色。通过将这两种互补架构集成到ActNetFormer框架中，我们的方法可以有效地捕捉动作的局部和全局上下文信息。这种全面的表示学习使模型能够在半监督动作识别任务中实现更好的性能，并利用每个架构的优势。在标准的动作识别数据集上进行的实验结果表明，我们的方法的表现优于现有方法，只需要很少的标注数据就能实现最先进的表现。本文的工作可以在其官方网站上查看：https:// this URL。

URL

https://arxiv.org/abs/2404.06243

PDF

https://arxiv.org/pdf/2404.06243.pdf

ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos

Abstract

Abstract (translated)

URL

PDF Copy

PDF