Paper Reading AI Learner

Multi-Stage Boundary-Aware Transformer Network for Action Segmentation in Untrimmed Surgical Videos

2025-04-26 01:07:56
Rezowan Shuvo, M S Mekala, Eyad Elyan

Abstract

Understanding actions within surgical workflows is essential for evaluating post-operative outcomes. However, capturing long sequences of actions performed in surgical settings poses challenges, as individual surgeons have their unique approaches shaped by their expertise, leading to significant variability. To tackle this complex problem, we focused on segmentation with precise boundaries, a demanding task due to the inherent variability in action durations and the subtle transitions often observed in untrimmed videos. These transitions, marked by ambiguous starting and ending points, complicate the segmentation process. Traditional models, such as MS-TCN, which depend on large receptive fields, frequently face challenges of over-segmentation (resulting in fragmented segments) or under-segmentation (merging distinct actions). Both of these issues negatively impact the quality of segmentation. To overcome these challenges, we present the Multi-Stage Boundary-Aware Transformer Network (MSBATN) with hierarchical sliding window attention, designed to enhance action segmentation. Our proposed approach incorporates a novel unified loss function that treats action classification and boundary detection as distinct yet interdependent tasks. Unlike traditional binary boundary detection methods, our boundary voting mechanism accurately identifies start and end points by leveraging contextual information. Extensive experiments using three challenging surgical datasets demonstrate the superior performance of the proposed method, achieving state-of-the-art results in F1 scores at thresholds of 25% and 50%, while also delivering comparable performance in other metrics.

Abstract (translated)

理解手术工作流程中的操作对于评估术后结果至关重要。然而,在手术环境中捕捉长时间的操作序列面临着挑战,因为每位外科医生都有自己独特的操作方式,这些方式受到其专业知识的影响,导致了显著的差异性。为了应对这一复杂问题,我们专注于具有精确边界的分割任务,这是一项极具挑战性的任务,原因是动作持续时间的内在变异性以及在未经修剪的视频中常见的微妙过渡。由于模糊不清的起点和终点标记,这些过渡使分割过程变得复杂化。 传统的模型如MS-TCN依赖于大的感受野,但经常面临过度分割(导致片段化)或欠分割(将不同的动作合并在一起)的问题。这两种情况都会影响分割的质量。为了解决这些问题,我们提出了一个多阶段边界感知变换网络(MSBATN),采用分层滑动窗口注意机制来增强操作分割。我们的方法包括一种新颖的统一损失函数,该函数将动作分类和边界检测视为相互独立却又彼此依赖的任务。与传统的二进制边界检测方法不同,我们的边界投票机制通过利用上下文信息准确地识别起点和终点。 使用三个具有挑战性的手术数据集进行的广泛实验表明了我们提出的方法在F1分数(阈值为25%和50%)上的优越性能,并且在其他指标上也提供了可比的结果。

URL

https://arxiv.org/abs/2504.18756

PDF

https://arxiv.org/pdf/2504.18756.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot