Paper Reading AI Learner

Self-supervised Learning of Echocardiographic Video Representations via Online Cluster Distillation

2025-06-13 13:36:33
Divyanshu Mishra, Mohammadreza Salehi, Pramit Saha, Olga Patey, Aris T. Papageorghiou, Yuki M. Asano, J. Alison Noble

Abstract

Self-supervised learning (SSL) has achieved major advances in natural images and video understanding, but challenges remain in domains like echocardiography (heart ultrasound) due to subtle anatomical structures, complex temporal dynamics, and the current lack of domain-specific pre-trained models. Existing SSL approaches such as contrastive, masked modeling, and clustering-based methods struggle with high intersample similarity, sensitivity to low PSNR inputs common in ultrasound, or aggressive augmentations that distort clinically relevant features. We present DISCOVR (Distilled Image Supervision for Cross Modal Video Representation), a self-supervised dual branch framework for cardiac ultrasound video representation learning. DISCOVR combines a clustering-based video encoder that models temporal dynamics with an online image encoder that extracts fine-grained spatial semantics. These branches are connected through a semantic cluster distillation loss that transfers anatomical knowledge from the evolving image encoder to the video encoder, enabling temporally coherent representations enriched with fine-grained semantic understanding. Evaluated on six echocardiography datasets spanning fetal, pediatric, and adult populations, DISCOVR outperforms both specialized video anomaly detection methods and state-of-the-art video-SSL baselines in zero-shot and linear probing setups, and achieves superior segmentation transfer.

Abstract (translated)

自监督学习(Self-supervised Learning,简称SSL)在自然图像和视频理解方面取得了重大进展,但在某些领域如超声心动图(心脏超声)中仍面临挑战。这些挑战主要源于微妙的解剖结构、复杂的时空动态变化以及目前缺乏特定领域的预训练模型。现有的自监督学习方法,例如对比学习、掩码建模和基于聚类的方法,在处理样本间相似度高、输入PSNR低(常见于超声波图像中的问题)或会扭曲临床相关特征的激进增强操作时遇到了困难。 我们提出了DISCOVR(Distilled Image Supervision for Cross-Modal Video Representation),这是一个用于心脏超声视频表征学习的自监督双分支框架。DISCOVR结合了一个基于聚类的视频编码器,该编码器模拟时间动态变化,并且还有一个在线图像编码器,它提取细粒度的空间语义信息。这些分支通过一个语义簇蒸馏损失连接起来,这个损失机制将不断演化的图像编码器中的解剖知识传递给视频编码器,从而生成包含精细语义理解的时空一致表示。 在涵盖胎儿、儿童和成人人群的六个超声心动图数据集上进行评估后,DISCOVR在零样本设置(zero-shot)和线性探测设置中超越了专门针对视频异常检测的方法以及最先进的视频自监督学习基线,并且实现了更好的分割迁移性能。

URL

https://arxiv.org/abs/2506.11777

PDF

https://arxiv.org/pdf/2506.11777.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot