Inflated 3D Convolution-Transformer for Weakly-supervised Carotid Stenosis Grading with Ultrasound Videos

Abstract
Abstract (translated)
URL
PDF

Abstract

Localization of the narrowest position of the vessel and corresponding vessel and remnant vessel delineation in carotid ultrasound (US) are essential for carotid stenosis grading (CSG) in clinical practice. However, the pipeline is time-consuming and tough due to the ambiguous boundaries of plaque and temporal variation. To automatize this procedure, a large number of manual delineations are usually required, which is not only laborious but also not reliable given the annotation difficulty. In this study, we present the first video classification framework for automatic CSG. Our contribution is three-fold. First, to avoid the requirement of laborious and unreliable annotation, we propose a novel and effective video classification network for weakly-supervised CSG. Second, to ease the model training, we adopt an inflation strategy for the network, where pre-trained 2D convolution weights can be adapted into the 3D counterpart in our network. In this way, the existing pre-trained large model can be used as an effective warm start for our network. Third, to enhance the feature discrimination of the video, we propose a novel attention-guided multi-dimension fusion (AMDF) transformer encoder to model and integrate global dependencies within and across spatial and temporal dimensions, where two lightweight cross-dimensional attention mechanisms are designed. Our approach is extensively validated on a large clinically collected carotid US video dataset, demonstrating state-of-the-art performance compared with strong competitors.

Abstract (translated)

在 carotid 超声波(US)中,确定 vessel 的狭窄位置及其对应的 vessel 和剩余 vessel 的绘制是临床 carotid 微血管狭窄评级(CSG)的关键。然而,由于 plaque 和时间变化的不确定性,这条管道相当耗时且困难。为了自动化这个过程,通常需要大量手动绘制,这不仅繁琐,而且由于标注难度的不可靠性,并不可靠。在本研究中,我们提出了第一个自动 CSG 视频分类框架。我们的贡献是三项。第一,为了避免繁琐的和不可靠的标注要求,我们提议一个 novel 和有效的视频分类网络,以弱监督的 CSG 为例。第二,为了简化模型训练,我们采用网络膨胀策略,其中预先训练的 2D 卷积权重可以适应在我们的网络中的 3D 对应物。这样,现有的预先训练的大型模型就可以用作我们的网络的有效热身。第三,为了增强视频的特征区分性,我们提议一个 novel 的注意引导多通道融合(AMDF)Transformer 编码器,以建模和整合空间和时间维度内和外部 global 依赖关系,并在两个轻量级跨维度注意力机制的设计下。我们的方法在大量 clinically collected carotid US 视频数据集上进行了全面验证,与强大的竞争对手相比,展示了最先进的性能。

URL

https://arxiv.org/abs/2306.02548

PDF

https://arxiv.org/pdf/2306.02548.pdf

Inflated 3D Convolution-Transformer for Weakly-supervised Carotid Stenosis Grading with Ultrasound Videos

Abstract

Abstract (translated)

URL

PDF Copy

PDF