Revisiting Foreground and Background Separation in Weakly-supervised Temporal Action Localization: A Clustering-based Approach

Abstract
Abstract (translated)
URL
PDF

Abstract

Weakly-supervised temporal action localization aims to localize action instances in videos with only video-level action labels. Existing methods mainly embrace a localization-by-classification pipeline that optimizes the snippet-level prediction with a video classification loss. However, this formulation suffers from the discrepancy between classification and detection, resulting in inaccurate separation of foreground and background (F\&B) snippets. To alleviate this problem, we propose to explore the underlying structure among the snippets by resorting to unsupervised snippet clustering, rather than heavily relying on the video classification loss. Specifically, we propose a novel clustering-based F\&B separation algorithm. It comprises two core components: a snippet clustering component that groups the snippets into multiple latent clusters and a cluster classification component that further classifies the cluster as foreground or background. As there are no ground-truth labels to train these two components, we introduce a unified self-labeling mechanism based on optimal transport to produce high-quality pseudo-labels that match several plausible prior distributions. This ensures that the cluster assignments of the snippets can be accurately associated with their F\&B labels, thereby boosting the F\&B separation. We evaluate our method on three benchmarks: THUMOS14, ActivityNet v1.2 and v1.3. Our method achieves promising performance on all three benchmarks while being significantly more lightweight than previous methods. Code is available at this https URL

Abstract (translated)

弱监督的时间动作定位旨在通过仅使用视频级别的动作标签来定位视频中的动作实例。现有的方法主要采用基于分类的定位管道来通过视频分类损失优化片段级别的预测。然而，这个公式存在分类和检测之间的差异，导致前景和背景（F&B）片段的准确分离。为了减轻这个问题，我们提出了一种通过无监督片段聚类来探索片段之间的潜在结构，而不是过分依赖视频分类损失。具体来说，我们提出了一种基于聚类的F&B分割算法。它包括两个核心组件：一个片段聚类组件，将片段分组到多个潜在聚类中，和一个聚类分类组件，进一步将聚类分类为前景或背景。由于没有用于训练这两个组件的地面真标签，我们引入了一种基于最优传输的统一自标签机制，产生高质量的反向样本，匹配多个可能的先验分布。这确保了片段分配的聚类可以准确与F&B标签相关联，从而提高F&B分割。我们在THUMOS14、ActivityNet v1.2和v1.3这三个基准上评估我们的方法。我们的方法在所有三个基准上都取得了良好的性能，而重量比以前的方法轻得多。代码可以从这个链接获取：https://www.aclweb.org/anthology/W17-6246

URL

https://arxiv.org/abs/2312.14138

PDF

https://arxiv.org/pdf/2312.14138.pdf

Revisiting Foreground and Background Separation in Weakly-supervised Temporal Action Localization: A Clustering-based Approach

Abstract

Abstract (translated)

URL

PDF Copy

PDF