STMixer: A One-Stage Sparse Action Detector

Abstract
Abstract (translated)
URL
PDF

Abstract

Traditional video action detectors typically adopt the two-stage pipeline, where a person detector is first employed to generate actor boxes and then 3D RoIAlign is used to extract actor-specific features for classification. This detection paradigm requires multi-stage training and inference, and the feature sampling is constrained inside the box, failing to effectively leverage richer context information outside. Recently, a few query-based action detectors have been proposed to predict action instances in an end-to-end manner. However, they still lack adaptability in feature sampling and decoding, thus suffering from the issues of inferior performance or slower convergence. In this paper, we propose two core designs for a more flexible one-stage sparse action detector. First, we present a query-based adaptive feature sampling module, which endows the detector with the flexibility of mining a group of discriminative features from the entire spatio-temporal domain. Second, we devise a decoupled feature mixing module, which dynamically attends to and mixes video features along the spatial and temporal dimensions respectively for better feature decoding. Based on these designs, we instantiate two detection pipelines, that is, STMixer-K for keyframe action detection and STMixer-T for action tubelet detection. Without bells and whistles, our STMixer detectors obtain state-of-the-art results on five challenging spatio-temporal action detection benchmarks for keyframe action detection or action tube detection.

Abstract (translated)

传统视频动作检测器通常采用两阶段流程，首先采用一个人工检测器生成演员框，然后使用3D RoIAlign提取演员特定特征进行分类。这种检测范式需要多阶段训练和推理，并且特征采样在盒子里进行约束，无法有效利用盒外更丰富的上下文信息。最近，一些基于查询的检测器被提出，以端到端地预测动作实例。然而，它们仍然缺乏可塑性在特征采样和解码方面，因此存在性能低下或收敛速度较慢的问题。在本文中，我们提出了两个更灵活的一阶段稀疏动作检测器的设计。首先，我们提出了一种基于查询的自适应特征采样模块，为检测器赋予了从整个空间和时间域中挖掘一组有差别的特征的灵活性。其次，我们设计了一种解耦的特征混合模块，分别在空间和时间维度上动态关注并混合视频特征，以实现更好的特征解码。基于这些设计，我们实例化了两个检测器，即STMixer-K用于关键帧动作检测，STMixer-T用于动作管束检测。在没有花哨的功能的情况下，我们的STMixer检测器在关键帧动作检测或动作管束检测的五个具有挑战性的空间和时间动作检测基准上取得了最先进的成果。

URL

https://arxiv.org/abs/2404.09842

PDF

https://arxiv.org/pdf/2404.09842.pdf

STMixer: A One-Stage Sparse Action Detector

Abstract

Abstract (translated)

URL

PDF Copy

PDF