Abstract
Inspired by the recent success of transformers and multi-stage architectures in video recognition and object detection domains. We thoroughly explore the rich spatio-temporal properties of transformers within a multi-stage architecture paradigm for the temporal action localization (TAL) task. This exploration led to the development of a hierarchical multi-stage transformer architecture called PCL-Former, where each subtask is handled by a dedicated transformer module with a specialized loss function. Specifically, the Proposal-Former identifies candidate segments in an untrimmed video that may contain actions, the Classification-Former classifies the action categories within those segments, and the Localization-Former precisely predicts the temporal boundaries (i.e., start and end) of the action instances. To evaluate the performance of our method, we have conducted extensive experiments on three challenging benchmark datasets: THUMOS-14, ActivityNet-1.3, and HACS Segments. We also conducted detailed ablation experiments to assess the impact of each individual module of our PCL-Former. The obtained quantitative results validate the effectiveness of the proposed PCL-Former, outperforming state-of-the-art TAL approaches by 2.8%, 1.2%, and 4.8% on THUMOS14, ActivityNet-1.3, and HACS datasets, respectively.
Abstract (translated)
受到近年来在视频识别和目标检测领域中变压器(Transformer)及多阶段架构成功应用的启发,我们深入探索了变压器在网络中的多层次架构下处理时间动作定位(TAL)任务时所具有的丰富时空特性。这一研究促进了分层多阶段变压器架构PCL-Former的发展,该架构通过专门设计的损失函数,让每个子任务都能由特定的Transformer模块来完成。具体来说: - Proposal-Former:识别未修剪视频中可能包含动作的候选片段。 - Classification-Former:对这些片段中的动作类别进行分类。 - Localization-Former:精确预测动作实例的时间边界(即开始和结束时间)。 为了评估我们方法的表现,我们在三个具有挑战性的基准数据集上进行了广泛的实验:THUMOS-14、ActivityNet-1.3 和 HACS Segments。此外,我们也进行了详细的消融研究来评估 PCL-Former 中每个单独模块的影响。所获得的定量结果验证了提出的 PCL-Former 的有效性,在 THUMOS14、ActivityNet-1.3 和 HACS 数据集上分别超过了现有的 TAL 方法 2.8%、1.2% 和 4.8%。
URL
https://arxiv.org/abs/2507.06411