Paper Reading AI Learner

STMixer: A One-Stage Sparse Action Detector

2024-04-15 14:52:02
Tao Wu, Mengqi Cao, Ziteng Gao, Gangshan Wu, Limin Wang

Abstract

Traditional video action detectors typically adopt the two-stage pipeline, where a person detector is first employed to generate actor boxes and then 3D RoIAlign is used to extract actor-specific features for classification. This detection paradigm requires multi-stage training and inference, and the feature sampling is constrained inside the box, failing to effectively leverage richer context information outside. Recently, a few query-based action detectors have been proposed to predict action instances in an end-to-end manner. However, they still lack adaptability in feature sampling and decoding, thus suffering from the issues of inferior performance or slower convergence. In this paper, we propose two core designs for a more flexible one-stage sparse action detector. First, we present a query-based adaptive feature sampling module, which endows the detector with the flexibility of mining a group of discriminative features from the entire spatio-temporal domain. Second, we devise a decoupled feature mixing module, which dynamically attends to and mixes video features along the spatial and temporal dimensions respectively for better feature decoding. Based on these designs, we instantiate two detection pipelines, that is, STMixer-K for keyframe action detection and STMixer-T for action tubelet detection. Without bells and whistles, our STMixer detectors obtain state-of-the-art results on five challenging spatio-temporal action detection benchmarks for keyframe action detection or action tube detection.

Abstract (translated)

传统视频动作检测器通常采用两阶段流程,首先采用一个人工检测器生成演员框,然后使用3D RoIAlign提取演员特定特征进行分类。这种检测范式需要多阶段训练和推理,并且特征采样在盒子里进行约束,无法有效利用盒外更丰富的上下文信息。最近,一些基于查询的检测器被提出,以端到端地预测动作实例。然而,它们仍然缺乏可塑性在特征采样和解码方面,因此存在性能低下或收敛速度较慢的问题。在本文中,我们提出了两个更灵活的一阶段稀疏动作检测器的设计。首先,我们提出了一种基于查询的自适应特征采样模块,为检测器赋予了从整个空间和时间域中挖掘一组有差别的特征的灵活性。其次,我们设计了一种解耦的特征混合模块,分别在空间和时间维度上动态关注并混合视频特征,以实现更好的特征解码。基于这些设计,我们实例化了两个检测器,即STMixer-K用于关键帧动作检测,STMixer-T用于动作管束检测。在没有花哨的功能的情况下,我们的STMixer检测器在关键帧动作检测或动作管束检测的五个具有挑战性的空间和时间动作检测基准上取得了最先进的成果。

URL

https://arxiv.org/abs/2404.09842

PDF

https://arxiv.org/pdf/2404.09842.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot