Abstract
In weakly-supervised temporal action localization, previous works have failed to locate dense and integral regions for each entire action due to the overestimation of the most salient regions. To alleviate this issue, we propose a marginalized average attentional network (MAAN) to suppress the dominant response of the most salient regions in a principled manner. The MAAN employs a novel marginalized average aggregation (MAA) module and learns a set of latent discriminative probabilities in an end-to-end fashion. MAA samples multiple subsets from the video snippet features according to a set of latent discriminative probabilities and takes the expectation over all the averaged subset features. Theoretically, we prove that the MAA module with learned latent discriminative probabilities successfully reduces the difference in responses between the most salient regions and the others. Therefore, MAAN is able to generate better class activation sequences and identify dense and integral action regions in the videos. Moreover, we propose a fast algorithm to reduce the complexity of constructing MAA from O($2^T$) to O($T^2$). Extensive experiments on two large-scale video datasets show that our MAAN achieves superior performance on weakly-supervised temporal action localization
Abstract (translated)
在弱监督的时间动作局部化中,由于对最显著区域的高估,以往的工作未能对每个动作的密集区域和积分区域进行定位。为了缓解这一问题,我们提出了边缘化平均注意力网络(MAAN),以原则性的方式抑制最突出区域的主导反应。MAAN采用一种新的边缘化平均聚集(MAA)模块,以端到端的方式学习一组潜在的识别概率。MAA根据一组潜在的识别概率从视频片段特征中抽取多个子集,并对所有平均子集特征进行期望。理论上,我们证明了具有已知潜在识别概率的MAA模块成功地降低了最显著区域和其他区域之间的响应差异。因此,MAAN能够生成更好的类激活序列,并识别视频中密集和完整的动作区域。此外,我们还提出了一种快速算法,将MAA的构造复杂度从O($2^t$)降低到O($t^2$)。对两个大型视频数据集的大量实验表明,我们的MAAN在弱监控时间动作定位方面取得了优异的性能。
URL
https://arxiv.org/abs/1905.08586