Abstract
We propose RareGraph-Synth, a knowledge-guided, continuous-time diffusion framework that generates realistic yet privacy-preserving synthetic electronic-health-record (EHR) trajectories for ultra-rare diseases. RareGraph-Synth unifies five public resources: Orphanet/Orphadata, the Human Phenotype Ontology (HPO), the GARD rare-disease KG, PrimeKG, and the FDA Adverse Event Reporting System (FAERS) into a heterogeneous knowledge graph comprising approximately 8 M typed edges. Meta-path scores extracted from this 8-million-edge KG modulate the per-token noise schedule in the forward stochastic differential equation, steering generation toward biologically plausible lab-medication-adverse-event co-occurrences while retaining score-based diffusion model stability. The reverse denoiser then produces timestamped sequences of lab-code, medication-code, and adverse-event-flag triples that contain no protected health information. On simulated ultra-rare-disease cohorts, RareGraph-Synth lowers categorical Maximum Mean Discrepancy by 40 percent relative to an unguided diffusion baseline and by greater than 60 percent versus GAN counterparts, without sacrificing downstream predictive utility. A black-box membership-inference evaluation using the DOMIAS attacker yields AUROC approximately 0.53, well below the 0.55 safe-release threshold and substantially better than the approximately 0.61 plus or minus 0.03 observed for non-KG baselines, demonstrating strong resistance to re-identification. These results suggest that integrating biomedical knowledge graphs directly into diffusion noise schedules can simultaneously enhance fidelity and privacy, enabling safer data sharing for rare-disease research.
Abstract (translated)
我们提出了一种名为RareGraph-Synth的框架,这是一种知识引导型、连续时间扩散模型,用于生成现实且保护隐私的人工电子健康记录(EHR)轨迹,以支持罕见病研究。RareGraph-Synth将五个公开资源——Orphanet/Orphadata、人类表型本体论(HPO)、GARD罕见疾病知识图谱(KG)、PrimeKG以及美国食品和药物管理局不良事件报告系统(FAERS),整合成一个异构的知识图,包含大约800万条类型化的边。从这个800万边的KG中提取元路径得分来调节前向随机微分方程中的每令牌噪声时间表,在生成过程中引导生物学上合理的实验室-药物-不良事件共现模式的同时保持评分扩散模型的稳定性。随后,反向去噪器将产生包含实验室代码、药物代码和不良事件标志三元组的时间戳序列,并确保不泄露任何受保护的健康信息。 在模拟罕见病队列中,与未经引导的扩散基准相比,RareGraph-Synth将类别的最大均值差异降低了40%,比GAN同类模型的降低幅度更是超过了60%,而不会牺牲下游预测效用。使用DOMIAS攻击者进行黑盒成员推理评估时,AUROC大约为0.53,远低于安全发布的阈值0.55,并且明显优于非KG基线中观察到的大约0.61±0.03,证明了对重新识别的强大抵抗力。 这些结果表明,直接将生物医学知识图集成到扩散噪声时间表中可以同时提高数据的真实性和隐私保护水平,从而为罕见病研究提供更安全的数据共享环境。
URL
https://arxiv.org/abs/2510.06267