Abstract
Existing supervised action segmentation methods depend on the quality of frame-wise classification using attention mechanisms or temporal convolutions to capture temporal dependencies. Even boundary detection-based methods primarily depend on the accuracy of an initial frame-wise classification, which can overlook precise identification of segments and boundaries in case of low-quality prediction. To address this problem, this paper proposes ASESM (Action Segmentation via Explicit Similarity Measurement) to enhance the segmentation accuracy by incorporating explicit similarity evaluation across frames and predictions. Our supervised learning architecture uses frame-level multi-resolution features as input to multiple Transformer encoders. The resulting multiple frame-wise predictions are used for similarity voting to obtain high quality initial prediction. We apply a newly proposed boundary correction algorithm that operates based on feature similarity between consecutive frames to adjust the boundary locations iteratively through the learning process. The corrected prediction is then further refined through multiple stages of temporal convolutions. As post-processing, we optionally apply boundary correction again followed by a segment smoothing method that removes outlier classes within segments using similarity measurement between consecutive predictions. Additionally, we propose a fully unsupervised boundary detection-correction algorithm that identifies segment boundaries based solely on feature similarity without any training. Experiments on 50Salads, GTEA, and Breakfast datasets show the effectiveness of both the supervised and unsupervised algorithms. Code and models are made available on Github.
Abstract (translated)
现有的监督动作分割方法依赖于注意力机制或时间卷积来捕捉帧级分类的质量,以捕获时间依赖性。即使是基于边界检测的方法也主要依赖初始帧级别分类的准确性,在预测质量较低的情况下可能会忽略精确识别段和边界的细节。为了解决这个问题,本文提出了ASESM(通过显式相似度测量的动作分割),通过在帧之间以及预测之间引入显式的相似度评估来增强分割精度。我们的监督学习架构将多分辨率帧级特征作为多个Transformer编码器的输入。生成的多个帧级别预测被用于相似性投票以获得高质量初始预测。我们应用了一个新的基于连续帧间特征相似性的边界修正算法,通过迭代的学习过程逐步调整边界位置。随后,经过多次时间卷积阶段进一步细化纠正后的预测结果。在后期处理中,我们可以选择再次执行边界修正,并通过测量连续预测之间的相似度来移除段内的离群类别以实现平滑化操作。 此外,我们还提出了一种完全无监督的边界检测校正算法,仅基于特征相似性而不需任何训练即可识别出段边界。在50Salads、GTEA和Breakfast数据集上的实验展示了该监督与非监督算法的有效性。代码和模型已在Github上公开提供。
URL
https://arxiv.org/abs/2502.10713