Abstract
The COVID-19 pandemic, caused by SARS-CoV-2, highlighted the critical need for accurate prediction of disease severity to optimize healthcare resource allocation and patient management. The spike protein, which facilitates viral entry into host cells, exhibits high mutation rates, particularly in the receptor-binding domain, influencing viral pathogenicity. Artificial intelligence approaches, such as deep learning, offer promising solutions for leveraging genomic and clinical data to predict disease outcomes. Objective: This study aimed to develop a hybrid CNN-LSTM deep learning model to predict COVID-19 severity using spike protein sequences and associated clinical metadata from South American patients. Methods: We retrieved 9,570 spike protein sequences from the GISAID database, of which 3,467 met inclusion criteria after standardization. The dataset included 2,313 severe and 1,154 mild cases. A feature engineering pipeline extracted features from sequences, while demographic and clinical variables were one-hot encoded. A hybrid CNN-LSTM architecture was trained, combining CNN layers for local pattern extraction and an LSTM layer for long-term dependency modeling. Results: The model achieved an F1 score of 82.92%, ROC-AUC of 0.9084, precision of 83.56%, and recall of 82.85%, demonstrating robust classification performance. Training stabilized at 85% accuracy with minimal overfitting. The most prevalent lineages (P.1, AY.99.2) and clades (GR, GK) aligned with regional epidemiological trends, suggesting potential associations between viral genetics and clinical outcomes. Conclusion: The CNN-LSTM hybrid model effectively predicted COVID-19 severity using spike protein sequences and clinical data, highlighting the utility of AI in genomic surveillance and precision public health. Despite limitations, this approach provides a framework for early severity prediction in future outbreaks.
Abstract (translated)
由SARS-CoV-2引起的COVID-19大流行凸显了准确预测疾病严重程度以优化医疗资源分配和患者管理的迫切需求。刺突蛋白,促进病毒进入宿主细胞的关键成分,表现出高变异率,特别是在受体结合域中,这对病毒感染性有显著影响。人工智能方法,如深度学习技术,为利用基因组数据和临床信息来预测疾病结果提供了潜在解决方案。研究目的:本研究旨在开发一种混合CNN-LSTM深度学习模型,使用南美患者的刺突蛋白序列及其相关临床元数据预测COVID-19的严重程度。 **方法**: 我们从GISAID数据库中检索到9,570个刺突蛋白序列,其中3,467个在标准化后符合纳入标准。该数据集包括2,313例重症和1,154例轻症患者。通过特征工程管道提取了序列的特征,而人口统计学和临床变量则进行了独热编码处理。训练了一种混合CNN-LSTM架构模型,结合卷积神经网络层进行局部模式抽取以及长短期记忆(LSTM)层用于建模长期依赖性。 **结果**: 模型实现了F1分数82.92%,ROC-AUC值0.9084,精确度83.56%和召回率82.85%,展示了强大的分类性能。训练稳定在85%的准确性,且过度拟合现象最小化。最常见的谱系(P.1, AY.99.2)和亚系(GR, GK)与区域流行病学趋势一致,表明病毒遗传学可能与临床结果之间存在潜在关联。 **结论**: 混合CNN-LSTM模型成功地使用刺突蛋白序列及临床数据预测了COVID-19的严重程度,强调了AI在基因组监测和精准公共卫生活动中的实用性。尽管存在局限性,但该方法为未来疫情中早期预测疾病严重程度提供了一种框架。
URL
https://arxiv.org/abs/2505.23879