A Large-scale Study of Spatiotemporal Representation Learning with a New Benchmark on Action Recognition

Abstract
Abstract (translated)
URL
PDF

Abstract

The goal of building a benchmark (suite of datasets) is to provide a unified protocol for fair evaluation and thus facilitate the evolution of a specific area. Nonetheless, we point out that existing protocols of action recognition could yield partial evaluations due to several limitations. To comprehensively probe the effectiveness of spatiotemporal representation learning, we introduce BEAR, a new BEnchmark on video Action Recognition. BEAR is a collection of 18 video datasets grouped into 5 categories (anomaly, gesture, daily, sports, and instructional), which covers a diverse set of real-world applications. With BEAR, we thoroughly evaluate 6 common spatiotemporal models pre-trained by both supervised and self-supervised learning. We also report transfer performance via standard finetuning, few-shot finetuning, and unsupervised domain adaptation. Our observation suggests that current state-of-the-art cannot solidly guarantee high performance on datasets close to real-world applications, and we hope BEAR can serve as a fair and challenging evaluation benchmark to gain insights on building next-generation spatiotemporal learners. Our dataset, code, and models are released at: this https URL

Abstract (translated)

建立基准(一组数据集)的目标是提供一个统一的标准协议来进行公正评估，从而促进特定领域的演化。然而，我们指出，由于存在多个限制，现有的行动识别协议可能会得出部分评估结果。为了全面测试时间空间表示学习的有效性，我们介绍了BEAR，这是一个视频行动识别的新基准。BEAR是一个由18个视频数据集组成的集合，分为五个类别(异常、手势、日常、运动和教学)，涵盖了多种实际应用场景。通过使用BEAR，我们全面评估了6个常见的时间空间模型，并通过标准微调、少量微调和无监督跨域适应等方式进行了迁移性能的测试。我们的观察表明，目前的最新技术无法完全保证接近实际应用场景的数据集的高表现，我们期望BEAR可以作为公正且具有挑战性的评估基准，以获得关于构建新一代时间空间学习器的见解。我们的数据集、代码和模型已发布在以下httpsURL:

URL

https://arxiv.org/abs/2303.13505

PDF

https://arxiv.org/pdf/2303.13505.pdf