Paper Reading AI Learner

Self-Supervised Learning for Interventional Image Analytics: Towards Robust Device Trackers

2024-05-02 10:18:22
Saahil Islam, Venkatesh N. Murthy, Dominik Neumann, Badhan Kumar Das, Puneet Sharma, Andreas Maier, Dorin Comaniciu, Florin C. Ghesu

Abstract

An accurate detection and tracking of devices such as guiding catheters in live X-ray image acquisitions is an essential prerequisite for endovascular cardiac interventions. This information is leveraged for procedural guidance, e.g., directing stent placements. To ensure procedural safety and efficacy, there is a need for high robustness no failures during tracking. To achieve that, one needs to efficiently tackle challenges, such as: device obscuration by contrast agent or other external devices or wires, changes in field-of-view or acquisition angle, as well as the continuous movement due to cardiac and respiratory motion. To overcome the aforementioned challenges, we propose a novel approach to learn spatio-temporal features from a very large data cohort of over 16 million interventional X-ray frames using self-supervision for image sequence data. Our approach is based on a masked image modeling technique that leverages frame interpolation based reconstruction to learn fine inter-frame temporal correspondences. The features encoded in the resulting model are fine-tuned downstream. Our approach achieves state-of-the-art performance and in particular robustness compared to ultra optimized reference solutions (that use multi-stage feature fusion, multi-task and flow regularization). The experiments show that our method achieves 66.31% reduction in maximum tracking error against reference solutions (23.20% when flow regularization is used); achieving a success score of 97.95% at a 3x faster inference speed of 42 frames-per-second (on GPU). The results encourage the use of our approach in various other tasks within interventional image analytics that require effective understanding of spatio-temporal semantics.

Abstract (translated)

准确检测和跟踪诸如引导导管等设备在活体X光成像中的操作,是进行内窥镜心脏干预的必要前提。这一信息用于指导操作,例如指导支架植入。为了确保操作的安全性和有效性,需要在跟踪过程中具备高鲁棒性,以避免失败。为了实现这一目标,需要有效地解决一些挑战,例如:对比剂或其他外部设备或导线对设备的遮挡,视野或成像角度的变化,以及由于心脏和呼吸运动而产生的连续运动。为了克服上述挑战,我们提出了一个新颖的方法,从超过1600万干预X光帧的大型数据集中学取空间-时间特征,通过自监督图像序列数据进行图像序列数据。我们的方法基于遮罩图像建模技术,利用基于重构的帧插值学习细粒度时间对应关系。在得到的模型中编码的特征经过下游微调。我们的方法在超优化参考解决方案(使用多级特征融合、多任务和流 regularization)方面实现了最先进的性能和鲁棒性。实验结果表明,我们的方法将最大跟踪误差减少了66.31%(使用流 regularization 时,降低了23.20%);在每秒42帧的推理速度下,实现了97.95%的成功率(在GPU上)。结果鼓励将我们的方法应用于各种需要在操作图像分析中有效理解空间-时间语义的各种其他任务。

URL

https://arxiv.org/abs/2405.01156

PDF

https://arxiv.org/pdf/2405.01156.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot