Paper Reading AI Learner

Improving action segmentation via explicit similarity measurement

2025-02-15 08:02:38
Kamel Aouaidjia, Wenhao Zhang, Aofan Li, Chongsheng Zhang

Abstract

Existing supervised action segmentation methods depend on the quality of frame-wise classification using attention mechanisms or temporal convolutions to capture temporal dependencies. Even boundary detection-based methods primarily depend on the accuracy of an initial frame-wise classification, which can overlook precise identification of segments and boundaries in case of low-quality prediction. To address this problem, this paper proposes ASESM (Action Segmentation via Explicit Similarity Measurement) to enhance the segmentation accuracy by incorporating explicit similarity evaluation across frames and predictions. Our supervised learning architecture uses frame-level multi-resolution features as input to multiple Transformer encoders. The resulting multiple frame-wise predictions are used for similarity voting to obtain high quality initial prediction. We apply a newly proposed boundary correction algorithm that operates based on feature similarity between consecutive frames to adjust the boundary locations iteratively through the learning process. The corrected prediction is then further refined through multiple stages of temporal convolutions. As post-processing, we optionally apply boundary correction again followed by a segment smoothing method that removes outlier classes within segments using similarity measurement between consecutive predictions. Additionally, we propose a fully unsupervised boundary detection-correction algorithm that identifies segment boundaries based solely on feature similarity without any training. Experiments on 50Salads, GTEA, and Breakfast datasets show the effectiveness of both the supervised and unsupervised algorithms. Code and models are made available on Github.

Abstract (translated)

现有的监督动作分割方法依赖于注意力机制或时间卷积来捕捉帧级分类的质量,以捕获时间依赖性。即使是基于边界检测的方法也主要依赖初始帧级别分类的准确性,在预测质量较低的情况下可能会忽略精确识别段和边界的细节。为了解决这个问题,本文提出了ASESM(通过显式相似度测量的动作分割),通过在帧之间以及预测之间引入显式的相似度评估来增强分割精度。我们的监督学习架构将多分辨率帧级特征作为多个Transformer编码器的输入。生成的多个帧级别预测被用于相似性投票以获得高质量初始预测。我们应用了一个新的基于连续帧间特征相似性的边界修正算法,通过迭代的学习过程逐步调整边界位置。随后,经过多次时间卷积阶段进一步细化纠正后的预测结果。在后期处理中,我们可以选择再次执行边界修正,并通过测量连续预测之间的相似度来移除段内的离群类别以实现平滑化操作。 此外,我们还提出了一种完全无监督的边界检测校正算法,仅基于特征相似性而不需任何训练即可识别出段边界。在50Salads、GTEA和Breakfast数据集上的实验展示了该监督与非监督算法的有效性。代码和模型已在Github上公开提供。

URL

https://arxiv.org/abs/2502.10713

PDF

https://arxiv.org/pdf/2502.10713.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot