Paper Reading AI Learner

Inflated 3D Convolution-Transformer for Weakly-supervised Carotid Stenosis Grading with Ultrasound Videos

2023-06-05 02:50:06
Xinrui Zhou, Yuhao Huang, Wufeng Xue, Xin Yang, Yuxin Zou, Qilong Ying, Yuanji Zhang, Jia Liu, Jie Ren, Dong Ni

Abstract

Localization of the narrowest position of the vessel and corresponding vessel and remnant vessel delineation in carotid ultrasound (US) are essential for carotid stenosis grading (CSG) in clinical practice. However, the pipeline is time-consuming and tough due to the ambiguous boundaries of plaque and temporal variation. To automatize this procedure, a large number of manual delineations are usually required, which is not only laborious but also not reliable given the annotation difficulty. In this study, we present the first video classification framework for automatic CSG. Our contribution is three-fold. First, to avoid the requirement of laborious and unreliable annotation, we propose a novel and effective video classification network for weakly-supervised CSG. Second, to ease the model training, we adopt an inflation strategy for the network, where pre-trained 2D convolution weights can be adapted into the 3D counterpart in our network. In this way, the existing pre-trained large model can be used as an effective warm start for our network. Third, to enhance the feature discrimination of the video, we propose a novel attention-guided multi-dimension fusion (AMDF) transformer encoder to model and integrate global dependencies within and across spatial and temporal dimensions, where two lightweight cross-dimensional attention mechanisms are designed. Our approach is extensively validated on a large clinically collected carotid US video dataset, demonstrating state-of-the-art performance compared with strong competitors.

Abstract (translated)

在 carotid 超声波(US)中,确定 vessel 的狭窄位置及其对应的 vessel 和剩余 vessel 的绘制是临床 carotid 微血管狭窄评级(CSG)的关键。然而,由于 plaque 和时间变化的不确定性,这条管道相当耗时且困难。为了自动化这个过程,通常需要大量手动绘制,这不仅繁琐,而且由于标注难度的不可靠性,并不可靠。在本研究中,我们提出了第一个自动 CSG 视频分类框架。我们的贡献是三项。第一,为了避免繁琐的和不可靠的标注要求,我们提议一个 novel 和有效的视频分类网络,以弱监督的 CSG 为例。第二,为了简化模型训练,我们采用网络膨胀策略,其中预先训练的 2D 卷积权重可以适应在我们的网络中的 3D 对应物。这样,现有的预先训练的大型模型就可以用作我们的网络的有效热身。第三,为了增强视频的特征区分性,我们提议一个 novel 的注意引导多通道融合(AMDF)Transformer 编码器,以建模和整合空间和时间维度内和外部 global 依赖关系,并在两个轻量级跨维度注意力机制的设计下。我们的方法在大量 clinically collected carotid US 视频数据集上进行了全面验证,与强大的竞争对手相比,展示了最先进的性能。

URL

https://arxiv.org/abs/2306.02548

PDF

https://arxiv.org/pdf/2306.02548.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot