Paper Reading AI Learner

Spatiotemporal Learning with Context-aware Video Tubelets for Ultrasound Video Analysis

2025-03-21 18:39:42
Gary Y. Li, Li Chen, Bryson Hicks, Nikolai Schnittke, David O. Kessler, Jeffrey Shupp, Maria Parker, Cristiana Baloescu, Christopher Moore, Cynthia Gregory, Kenton Gregory, Balasundar Raju, Jochen Kruecker, Alvin Chen

Abstract

Computer-aided pathology detection algorithms for video-based imaging modalities must accurately interpret complex spatiotemporal information by integrating findings across multiple frames. Current state-of-the-art methods operate by classifying on video sub-volumes (tubelets), but they often lose global spatial context by focusing only on local regions within detection ROIs. Here we propose a lightweight framework for tubelet-based object detection and video classification that preserves both global spatial context and fine spatiotemporal features. To address the loss of global context, we embed tubelet location, size, and confidence as inputs to the classifier. Additionally, we use ROI-aligned feature maps from a pre-trained detection model, leveraging learned feature representations to increase the receptive field and reduce computational complexity. Our method is efficient, with the spatiotemporal tubelet classifier comprising only 0.4M parameters. We apply our approach to detect and classify lung consolidation and pleural effusion in ultrasound videos. Five-fold cross-validation on 14,804 videos from 828 patients shows our method outperforms previous tubelet-based approaches and is suited for real-time workflows.

Abstract (translated)

计算机辅助的病理检测算法在基于视频成像模式的应用中,必须能够准确地解析复杂的时空信息,并将多个帧中的发现进行整合。当前最先进的方法通过在视频子体积(管状体)上分类来运行,但它们往往由于只关注检测区域内的局部区域而丧失全局空间上下文。在这里,我们提出了一种轻量级框架,用于基于管状体的对象检测和视频分类,该框架能够保留全局空间上下文及精细的时空特征。 为了解决全局上下文丢失的问题,我们将管状体的位置、大小和置信度作为输入提供给分类器。此外,我们使用来自预先训练好的检测模型的关键区域(ROI)对齐后的特征图,利用学习到的特征表示来扩大感受野并减少计算复杂性。我们的方法是高效的,时空管状体分类器仅包含0.4M参数。 我们将这种方法应用于超声视频中肺实变和胸腔积液的检测与分类。通过对来自828名患者的14,804段视频进行五折交叉验证,结果显示我们的方法优于以前基于管状体的方法,并且适合于实时工作流程。

URL

https://arxiv.org/abs/2503.17475

PDF

https://arxiv.org/pdf/2503.17475.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot