Abstract
Recently, temporal action localization (TAL) has garnered significant interest in information retrieval community. However, existing supervised/weakly supervised methods are heavily dependent on extensive labeled temporal boundaries and action categories, which is labor-intensive and time-consuming. Although some unsupervised methods have utilized the ``iteratively clustering and localization'' paradigm for TAL, they still suffer from two pivotal impediments: 1) unsatisfactory video clustering confidence, and 2) unreliable video pseudolabels for model training. To address these limitations, we present a novel self-paced incremental learning model to enhance clustering and localization training simultaneously, thereby facilitating more effective unsupervised TAL. Concretely, we improve the clustering confidence through exploring the contextual feature-robust visual information. Thereafter, we design two (constant- and variable- speed) incremental instance learning strategies for easy-to-hard model training, thus ensuring the reliability of these video pseudolabels and further improving overall localization performance. Extensive experiments on two public datasets have substantiated the superiority of our model over several state-of-the-art competitors.
Abstract (translated)
近年来,时间动作局部定位(TAL)在信息检索领域引起了广泛关注。然而,现有的监督/弱监督方法在很大程度上依赖于广泛的标记时间边界的动作类别,这需要大量的人力和时间。尽管一些无监督方法利用了“迭代聚类和局部化”范式进行TAL,但它们仍然受到两个关键限制:1)不满意的视频聚类置信度,2)模型训练不可靠的视频伪标签。为了克服这些限制,我们提出了一个自适应的增量学习模型,以同时增强聚类和局部化训练,从而促进更有效的无监督TAL。具体来说,我们通过探索上下文特征鲁棒的视觉信息来提高聚类置信度。然后,我们设计了两组(恒定速度和可变速度)增量实例学习策略,用于易到难的模型训练,从而确保这些视频伪标签的可靠性,并进一步提高整体局部化性能。在两个公开数据集上进行的大量实验证实了我们的模型相对于最先进的竞争对手的优越性。
URL
https://arxiv.org/abs/2312.07384