Paper Reading AI Learner

Training a high-performance retinal foundation model with half-the-data and 400 times less compute

2024-04-30 18:08:08
Justin Engelmann, Miguel O. Bernabeu

Abstract

Artificial Intelligence holds tremendous potential in medicine, but is traditionally limited by the lack of massive datasets to train models on. Foundation models, pre-trained models that can be adapted to downstream tasks with small datasets, could alleviate this problem. Researchers at Moorfields Eye Hospital (MEH) proposed RETFound-MEH, a foundation model for retinal imaging that was trained on 900,000 images, including private hospital data. Recently, data-efficient DERETFound was proposed that provides comparable performance while being trained on only 150,000 images that are all publicly available. However, both these models required very substantial resources to train initially and are resource-intensive in downstream use. We propose a novel Token Reconstruction objective that we use to train RETFound-Green, a retinal foundation model trained using only 75,000 publicly available images and 400 times less compute. We estimate the cost of training RETFound-MEH and DERETFound at $10,000 and $14,000, respectively, while RETFound-Green could be trained for less than $100, with equally reduced environmental impact. RETFound-Green is also far more efficient in downstream use: it can be downloaded 14 times faster, computes vector embeddings 2.7 times faster which then require 2.6 times less storage space. Despite this, RETFound-Green does not perform systematically worse. In fact, it performs best on 14 tasks, compared to six for DERETFound and two for RETFound-MEH. Our results suggest that RETFound-Green is a very efficient, high-performance retinal foundation model. We anticipate that our Token Reconstruction objective could be scaled up for even higher performance and be applied to other domains beyond retinal imaging.

Abstract (translated)

人工智能在医学领域具有巨大的潜力,但传统上受到大规模数据集缺乏的限制。基础模型、可以适应下游任务的小数据集预训练模型,可以缓解这一问题。英国莫费尔德眼科医院(MEH)的研究人员提出了RETFound-MEH,一种基于900,000张图像的视网膜成像基础模型,其中包括私立医院数据。最近,提出了数据高效的DERETFound模型,该模型在仅使用150,000张可公开获取的图像进行训练的同时,提供了与原模相同的表现。然而,这两种模型在训练初期都需要非常大量的资源,并且在下游使用时也是资源密集型。我们提出了一个新颖的标记重构目标,该目标我们用于训练RETFind-Green,一种仅使用75,000张公开可获取图像的视网膜基础模型,以及400倍较少的计算资源。我们估计RETFind-MEH和DERETFound的训练成本分别为10,000美元和14,000美元,而RETFind-Green的训练成本可以低于100美元,同时具有与原模相当程度的减少环境影响的效应。RETFind-Green在下游使用方面也远更高效:它可以实现14次更快地下载,同时计算向量嵌入速度比需要2.6倍更少的存储空间。尽管如此,RETFind-Green在系统内并没有表现出更差的表现。事实上,与DERETFound和RETFind-MEH相比,RETFind-Green在14个任务上表现最好,相对于六项任务,RETFind-Green的性能更好。我们的结果表明,RETFind-Green是一种非常高效、高性能的视网膜基础模型。我们预计,我们的标记重构目标可以进行扩展,以实现更高的性能,并应用于其他领域,而不仅仅是视网膜成像。

URL

https://arxiv.org/abs/2405.00117

PDF

https://arxiv.org/pdf/2405.00117.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot