Abstract
Purpose: Surgical video is an important data stream for gesture recognition. Thus, robust visual encoders for those data-streams is similarly important. Methods: Leveraging the Bridge-Prompt framework, we fine-tune a pre-trained vision-text model (CLIP) for gesture recognition in surgical videos. This can utilize extensive outside video data such as text, but also make use of label meta-data and weakly supervised contrastive losses. Results: Our experiments show that prompt-based video encoder outperforms standard encoders in surgical gesture recognition tasks. Notably, it displays strong performance in zero-shot scenarios, where gestures/tasks that were not provided during the encoder training phase are included in the prediction phase. Additionally, we measure the benefit of inclusion text descriptions in the feature extractor training schema. Conclusion: Bridge-Prompt and similar pre-trained+fine-tuned video encoder models present significant visual representation for surgical robotics, especially in gesture recognition tasks. Given the diverse range of surgical tasks (gestures), the ability of these models to zero-shot transfer without the need for any task (gesture) specific retraining makes them invaluable.
Abstract (translated)
目的:手术视频是手势识别的重要数据流。因此,对于这些数据流,同样重要的是具有稳健的视觉编码器。方法:利用Bridge-Prompt框架,我们对一个预训练的视觉-文本模型(CLIP)进行手术视频手势识别的微调。这可以利用广泛的外部视频数据(如文本),但还利用标签元数据和弱监督的对比损失。结果:我们的实验结果表明,基于提示的视频编码器在手术手势识别任务中优于标准编码器。值得注意的是,在编码器训练阶段没有提供手势/任务的情况下,表现出优异的零散场景性能。此外,我们还在特征提取器训练方案中测量了包含文本描述的好处。结论:Bridge-Prompt和其他预训练+微调的视频编码器模型为手术机器人提供了显著的视觉表示,特别是在手势识别任务中。考虑到手术任务的多样范围(手势),这些模型在没有任何特定任务(手势)重新训练的情况下实现零散场景转移的能力使其无价。
URL
https://arxiv.org/abs/2403.19786