Paper Reading AI Learner

ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos

2024-04-09 12:09:56
Sharana Dharshikgan Suresh Dass, Hrishav Bakul Barua, Ganesh Krishnasamy, Raveendran Paramesran, Raphael C.-W. Phan

Abstract

Human action or activity recognition in videos is a fundamental task in computer vision with applications in surveillance and monitoring, self-driving cars, sports analytics, human-robot interaction and many more. Traditional supervised methods require large annotated datasets for training, which are expensive and time-consuming to acquire. This work proposes a novel approach using Cross-Architecture Pseudo-Labeling with contrastive learning for semi-supervised action recognition. Our framework leverages both labeled and unlabelled data to robustly learn action representations in videos, combining pseudo-labeling with contrastive learning for effective learning from both types of samples. We introduce a novel cross-architecture approach where 3D Convolutional Neural Networks (3D CNNs) and video transformers (VIT) are utilised to capture different aspects of action representations; hence we call it ActNetFormer. The 3D CNNs excel at capturing spatial features and local dependencies in the temporal domain, while VIT excels at capturing long-range dependencies across frames. By integrating these complementary architectures within the ActNetFormer framework, our approach can effectively capture both local and global contextual information of an action. This comprehensive representation learning enables the model to achieve better performance in semi-supervised action recognition tasks by leveraging the strengths of each of these architectures. Experimental results on standard action recognition datasets demonstrate that our approach performs better than the existing methods, achieving state-of-the-art performance with only a fraction of labeled data. The official website of this work is available at: this https URL.

Abstract (translated)

视频中的人类动作或活动识别是一个在计算机视觉领域的基本任务,应用于监视和监控、自动驾驶汽车、运动分析、人机交互等领域。传统的监督方法需要大量的标注数据进行训练,这些数据耗费时间和金钱。本文提出了一种新颖的方法,使用交叉架构伪标签与对比学习进行半监督动作识别。我们的框架利用已标注和未标注数据来稳健地学习视频中的动作表示,将伪标签与对比学习相结合,实现从两种类型的样本中有效地学习动作表示。我们引入了一种新颖的跨架构方法,其中3D卷积神经网络(3D CNN)和视频转换器(VIT)被用于捕捉不同动作表示的各个方面;因此我们称之为ActNetFormer。3D CNN在捕捉时间域中的空间特征和局部依赖方面表现出色,而VIT在捕捉帧之间的长距离依赖方面表现出色。通过将这两种互补架构集成到ActNetFormer框架中,我们的方法可以有效地捕捉动作的局部和全局上下文信息。这种全面的表示学习使模型能够在半监督动作识别任务中实现更好的性能,并利用每个架构的优势。在标准的动作识别数据集上进行的实验结果表明,我们的方法的表现优于现有方法,只需要很少的标注数据就能实现最先进的表现。本文的工作可以在其官方网站上查看:https:// this URL。

URL

https://arxiv.org/abs/2404.06243

PDF

https://arxiv.org/pdf/2404.06243.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot