Paper Reading AI Learner

A Large-scale Study of Spatiotemporal Representation Learning with a New Benchmark on Action Recognition

2023-03-23 17:58:05
Andong Deng, Taojiannan Yang, Chen Chen

Abstract

The goal of building a benchmark (suite of datasets) is to provide a unified protocol for fair evaluation and thus facilitate the evolution of a specific area. Nonetheless, we point out that existing protocols of action recognition could yield partial evaluations due to several limitations. To comprehensively probe the effectiveness of spatiotemporal representation learning, we introduce BEAR, a new BEnchmark on video Action Recognition. BEAR is a collection of 18 video datasets grouped into 5 categories (anomaly, gesture, daily, sports, and instructional), which covers a diverse set of real-world applications. With BEAR, we thoroughly evaluate 6 common spatiotemporal models pre-trained by both supervised and self-supervised learning. We also report transfer performance via standard finetuning, few-shot finetuning, and unsupervised domain adaptation. Our observation suggests that current state-of-the-art cannot solidly guarantee high performance on datasets close to real-world applications, and we hope BEAR can serve as a fair and challenging evaluation benchmark to gain insights on building next-generation spatiotemporal learners. Our dataset, code, and models are released at: this https URL

Abstract (translated)

建立基准(一组数据集)的目标是提供一个统一的标准协议来进行公正评估,从而促进特定领域的演化。然而,我们指出,由于存在多个限制,现有的行动识别协议可能会得出部分评估结果。为了全面测试时间空间表示学习的有效性,我们介绍了BEAR,这是一个视频行动识别的新基准。BEAR是一个由18个视频数据集组成的集合,分为五个类别(异常、手势、日常、运动和教学),涵盖了多种实际应用场景。通过使用BEAR,我们全面评估了6个常见的时间空间模型,并通过标准微调、少量微调和无监督跨域适应等方式进行了迁移性能的测试。我们的观察表明,目前的最新技术无法完全保证接近实际应用场景的数据集的高表现,我们期望BEAR可以作为公正且具有挑战性的评估基准,以获得关于构建新一代时间空间学习器的见解。我们的数据集、代码和模型已发布在以下httpsURL:

URL

https://arxiv.org/abs/2303.13505

PDF

https://arxiv.org/pdf/2303.13505.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot