Abstract
Accurate imputation of missing laboratory values in electronic health records (EHRs) is critical to enable robust clinical predictions and reduce biases in AI systems in healthcare. Existing methods, such as variational autoencoders (VAEs) and decision tree-based approaches such as XGBoost, struggle to model the complex temporal and contextual dependencies in EHR data, mainly in underrepresented groups. In this work, we propose Lab-MAE, a novel transformer-based masked autoencoder framework that leverages self-supervised learning for the imputation of continuous sequential lab values. Lab-MAE introduces a structured encoding scheme that jointly models laboratory test values and their corresponding timestamps, enabling explicit capturing temporal dependencies. Empirical evaluation on the MIMIC-IV dataset demonstrates that Lab-MAE significantly outperforms the state-of-the-art baselines such as XGBoost across multiple metrics, including root mean square error (RMSE), R-squared (R2), and Wasserstein distance (WD). Notably, Lab-MAE achieves equitable performance across demographic groups of patients, advancing fairness in clinical predictions. We further investigate the role of follow-up laboratory values as potential shortcut features, revealing Lab-MAE's robustness in scenarios where such data is unavailable. The findings suggest that our transformer-based architecture, adapted to the characteristics of the EHR data, offers a foundation model for more accurate and fair clinical imputation models. In addition, we measure and compare the carbon footprint of Lab-MAE with the baseline XGBoost model, highlighting its environmental requirements.
Abstract (translated)
在电子健康记录(EHR)中准确填补缺失的实验室值对于实现稳健的临床预测和减少医疗保健中人工智能系统的偏差至关重要。现有的方法,如变分自动编码器(VAEs)和基于决策树的方法(例如XGBoost),难以捕捉EHR数据中的复杂时间序列和上下文依赖性,尤其是在代表性不足的人群中。在本研究中,我们提出了一种名为Lab-MAE的新框架,这是一种基于变压器的掩码自编码器,利用自我监督学习来填补连续序列实验室值。Lab-MAE引入了一种结构化编码方案,可以同时建模实验室测试值及其相应的时间戳,从而明确捕捉时间依赖性。 在MIMIC-IV数据集上的实证评估表明,Lab-MAE在包括均方根误差(RMSE)、R平方(R2)和Wasserstein距离(WD)等在内的多个指标上显著优于最先进的基线方法XGBoost。值得注意的是,Lab-MAE实现了不同患者人口统计群体的公平性能,在临床预测中推进了公正性。我们进一步探讨了后续实验室值作为潜在捷径特征的作用,揭示了在缺乏此类数据的情况下,Lab-MAE仍表现出稳健性的特点。 研究结果表明,根据EHR数据的特点调整的基于变压器架构为更准确和公正的临床填补模型提供了基础模型。此外,我们还测量并比较了Lab-MAE与基线XGBoost模型的碳足迹,强调了其环境要求。
URL
https://arxiv.org/abs/2501.02648