Abstract
Temporal action segmentation is typically achieved by discovering the dramatic variances in global visual descriptors. In this paper, we explore the merits of local features by proposing the unsupervised framework of Object-centric Temporal Action Segmentation (OTAS). Broadly speaking, OTAS consists of self-supervised global and local feature extraction modules as well as a boundary selection module that fuses the features and detects salient boundaries for action segmentation. As a second contribution, we discuss the pros and cons of existing frame-level and boundary-level evaluation metrics. Through extensive experiments, we find OTAS is superior to the previous state-of-the-art method by $41\%$ on average in terms of our recommended F1 score. Surprisingly, OTAS even outperforms the ground-truth human annotations in the user study. Moreover, OTAS is efficient enough to allow real-time inference.
Abstract (translated)
时间动作分割通常通过发现全球视觉描述符的重大差异来实现。在本文中,我们提出了对象中心的时间动作分割(OTAS)框架,以探索本地特征的优点。OTAS广义地说包括自监督的全球和本地特征提取模块以及边界选择模块,将特征融合并检测运动分割的显著边界。作为第二贡献,我们讨论了现有帧级和边界级评估 metrics 的优缺点。通过广泛的实验,我们发现OTAS平均比先前的最先进的方法高出41%,在推荐F1得分方面表现更好。令人惊讶地,OTAS在用户研究中甚至优于真实值人类标注。此外,OTAS高效 enough 以允许实时推断。
URL
https://arxiv.org/abs/2309.06276