Abstract
Purpose: Surgical workflow analysis is crucial for improving surgical efficiency and safety. However, previous studies rely heavily on large-scale annotated datasets, posing challenges in cost, scalability, and reliance on expert annotations. To address this, we propose Surg-FTDA (Few-shot Text-driven Adaptation), designed to handle various surgical workflow analysis tasks with minimal paired image-label data. Methods: Our approach has two key components. First, Few-shot selection-based modality alignment selects a small subset of images and aligns their embeddings with text embeddings from the downstream task, bridging the modality gap. Second, Text-driven adaptation leverages only text data to train a decoder, eliminating the need for paired image-text data. This decoder is then applied to aligned image embeddings, enabling image-related tasks without explicit image-text pairs. Results: We evaluate our approach to generative tasks (image captioning) and discriminative tasks (triplet recognition and phase recognition). Results show that Surg-FTDA outperforms baselines and generalizes well across downstream tasks. Conclusion: We propose a text-driven adaptation approach that mitigates the modality gap and handles multiple downstream tasks in surgical workflow analysis, with minimal reliance on large annotated datasets. The code and dataset will be released in this https URL.
Abstract (translated)
翻译: 目的:手术工作流程分析对于提高手术效率和安全性至关重要。然而,以往的研究严重依赖大规模标注数据集,在成本、可扩展性和对专家注释的依赖方面存在挑战。为了应对这一问题,我们提出了Surg-FTDA(少量样本文本驱动适应),旨在仅使用少量配对图像标签数据来处理各种手术工作流程分析任务。 方法:我们的方法包含两个关键组成部分。首先,“基于少量样本选择的模态对齐”选取一小部分图像,并将其嵌入与下游任务中的文本嵌入对齐,以此弥合了模态差距。其次,“文本驱动适应”仅利用文本数据训练解码器,从而无需配对的图像-文本数据。然后将此解码器应用于对齐后的图像嵌入中,使在没有明确图像-文本对的情况下也能执行与图像相关的任务。 结果:我们评估了Surg-FTDA在生成性任务(图像描述)和判别性任务(三元组识别和阶段识别)中的表现。结果显示,Surg-FTDA优于基准方法,并且能够很好地泛化到下游任务中。结论:我们提出了一种文本驱动适应的方法,该方法减轻了模态差距并处理了手术工作流程分析的多个下游任务,同时大大减少了对大规模标注数据集的依赖。代码和数据集将在此网址发布(注:原文中没有提供具体的URL链接)。
URL
https://arxiv.org/abs/2501.09555