Paper Reading AI Learner

Hierarchical Multi-Stage Transformer Architecture for Context-Aware Temporal Action Localization

2025-07-08 21:28:16
Hayat Ullah, Arslan Munir, Oliver Nina

Abstract

Inspired by the recent success of transformers and multi-stage architectures in video recognition and object detection domains. We thoroughly explore the rich spatio-temporal properties of transformers within a multi-stage architecture paradigm for the temporal action localization (TAL) task. This exploration led to the development of a hierarchical multi-stage transformer architecture called PCL-Former, where each subtask is handled by a dedicated transformer module with a specialized loss function. Specifically, the Proposal-Former identifies candidate segments in an untrimmed video that may contain actions, the Classification-Former classifies the action categories within those segments, and the Localization-Former precisely predicts the temporal boundaries (i.e., start and end) of the action instances. To evaluate the performance of our method, we have conducted extensive experiments on three challenging benchmark datasets: THUMOS-14, ActivityNet-1.3, and HACS Segments. We also conducted detailed ablation experiments to assess the impact of each individual module of our PCL-Former. The obtained quantitative results validate the effectiveness of the proposed PCL-Former, outperforming state-of-the-art TAL approaches by 2.8%, 1.2%, and 4.8% on THUMOS14, ActivityNet-1.3, and HACS datasets, respectively.

Abstract (translated)

受到近年来在视频识别和目标检测领域中变压器(Transformer)及多阶段架构成功应用的启发,我们深入探索了变压器在网络中的多层次架构下处理时间动作定位(TAL)任务时所具有的丰富时空特性。这一研究促进了分层多阶段变压器架构PCL-Former的发展,该架构通过专门设计的损失函数,让每个子任务都能由特定的Transformer模块来完成。具体来说: - Proposal-Former:识别未修剪视频中可能包含动作的候选片段。 - Classification-Former:对这些片段中的动作类别进行分类。 - Localization-Former:精确预测动作实例的时间边界(即开始和结束时间)。 为了评估我们方法的表现,我们在三个具有挑战性的基准数据集上进行了广泛的实验:THUMOS-14、ActivityNet-1.3 和 HACS Segments。此外,我们也进行了详细的消融研究来评估 PCL-Former 中每个单独模块的影响。所获得的定量结果验证了提出的 PCL-Former 的有效性,在 THUMOS14、ActivityNet-1.3 和 HACS 数据集上分别超过了现有的 TAL 方法 2.8%、1.2% 和 4.8%。

URL

https://arxiv.org/abs/2507.06411

PDF

https://arxiv.org/pdf/2507.06411.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot