Abstract
Temporal action detection (TAD) aims to detect all action boundaries and their corresponding categories in an untrimmed video. The unclear boundaries of actions in videos often result in imprecise predictions of action boundaries by existing methods. To resolve this issue, we propose a one-stage framework named TriDet. First, we propose a Trident-head to model the action boundary via an estimated relative probability distribution around the boundary. Then, we analyze the rank-loss problem (i.e. instant discriminability deterioration) in transformer-based methods and propose an efficient scalable-granularity perception (SGP) layer to mitigate this issue. To further push the limit of instant discriminability in the video backbone, we leverage the strong representation capability of pretrained large models and investigate their performance on TAD. Last, considering the adequate spatial-temporal context for classification, we design a decoupled feature pyramid network with separate feature pyramids to incorporate rich spatial context from the large model for localization. Experimental results demonstrate the robustness of TriDet and its state-of-the-art performance on multiple TAD datasets, including hierarchical (multilabel) TAD datasets.
Abstract (translated)
时间动作检测(TAD)旨在在未剪辑的视频中提取所有动作边界及其对应分类,但视频中的动作边界通常不够清晰,导致现有方法无法准确预测动作边界。为了解决这一问题,我们提出了名为TriDet的一阶段框架。我们首先提出了一个箭頭模型,通过估计边界周围的相对概率分布来建模动作边界。然后,我们分析了基于Transformer的方法中的排名损失问题(即即时分类能力恶化),并提出了高效的可扩展级联精度感知层(SGP)来解决该问题。为了进一步逼近视频主干中即时分类能力的极限,我们利用预训练大型模型的强大表示能力,并研究了它们在TAD任务中的表现。最后,考虑到适当的空间和时间上下文来进行分类,我们设计了一个分离的特征金字塔网络,并使用独立的特征金字塔来集成大型模型提供的丰富空间上下文来进行定位。实验结果显示TriDet的鲁棒性和其在多个TAD数据集上包括层次(多标签)TAD数据集上的最新性能。
URL
https://arxiv.org/abs/2309.05590