Paper Reading AI Learner

Representation Learning of Lab Values via Masked AutoEncoder

2025-01-05 20:26:49
David Restrepo, Chenwei Wu, Yueran Jia, Jaden K. Sun, Jack Gallifant, Catherine G. Bielick, Yugang Jia, Leo A. Celi

Abstract

Accurate imputation of missing laboratory values in electronic health records (EHRs) is critical to enable robust clinical predictions and reduce biases in AI systems in healthcare. Existing methods, such as variational autoencoders (VAEs) and decision tree-based approaches such as XGBoost, struggle to model the complex temporal and contextual dependencies in EHR data, mainly in underrepresented groups. In this work, we propose Lab-MAE, a novel transformer-based masked autoencoder framework that leverages self-supervised learning for the imputation of continuous sequential lab values. Lab-MAE introduces a structured encoding scheme that jointly models laboratory test values and their corresponding timestamps, enabling explicit capturing temporal dependencies. Empirical evaluation on the MIMIC-IV dataset demonstrates that Lab-MAE significantly outperforms the state-of-the-art baselines such as XGBoost across multiple metrics, including root mean square error (RMSE), R-squared (R2), and Wasserstein distance (WD). Notably, Lab-MAE achieves equitable performance across demographic groups of patients, advancing fairness in clinical predictions. We further investigate the role of follow-up laboratory values as potential shortcut features, revealing Lab-MAE's robustness in scenarios where such data is unavailable. The findings suggest that our transformer-based architecture, adapted to the characteristics of the EHR data, offers a foundation model for more accurate and fair clinical imputation models. In addition, we measure and compare the carbon footprint of Lab-MAE with the baseline XGBoost model, highlighting its environmental requirements.

Abstract (translated)

在电子健康记录(EHR)中准确填补缺失的实验室值对于实现稳健的临床预测和减少医疗保健中人工智能系统的偏差至关重要。现有的方法,如变分自动编码器(VAEs)和基于决策树的方法(例如XGBoost),难以捕捉EHR数据中的复杂时间序列和上下文依赖性,尤其是在代表性不足的人群中。在本研究中,我们提出了一种名为Lab-MAE的新框架,这是一种基于变压器的掩码自编码器,利用自我监督学习来填补连续序列实验室值。Lab-MAE引入了一种结构化编码方案,可以同时建模实验室测试值及其相应的时间戳,从而明确捕捉时间依赖性。 在MIMIC-IV数据集上的实证评估表明,Lab-MAE在包括均方根误差(RMSE)、R平方(R2)和Wasserstein距离(WD)等在内的多个指标上显著优于最先进的基线方法XGBoost。值得注意的是,Lab-MAE实现了不同患者人口统计群体的公平性能,在临床预测中推进了公正性。我们进一步探讨了后续实验室值作为潜在捷径特征的作用,揭示了在缺乏此类数据的情况下,Lab-MAE仍表现出稳健性的特点。 研究结果表明,根据EHR数据的特点调整的基于变压器架构为更准确和公正的临床填补模型提供了基础模型。此外,我们还测量并比较了Lab-MAE与基线XGBoost模型的碳足迹,强调了其环境要求。

URL

https://arxiv.org/abs/2501.02648

PDF

https://arxiv.org/pdf/2501.02648.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot