Abstract
Explainable artificial intelligence (XAI) approaches have been increasingly applied in drug discovery to learn molecular representations and identify substructures driving property predictions. However, building end-to-end explainable machine learning models for structure-activity relationship (SAR) modeling for compound property prediction faces many challenges, such as limited activity data per target and the sensitivity of properties to subtle molecular changes. To address this, we leveraged activity-cliff molecule pairs, i.e., compounds sharing a common scaffold but differing sharply in potency, targeting three proto-oncogene tyrosine-protein kinase Src proteins (i.e., PDB IDs 1O42, 2H8H, and 4MXO). We implemented graph neural network (GNN) methods to obtain atom-level feature information and predict compound-protein affinity (i.e., half maximal inhibitory concentration, IC50). In addition, we trained GNN models with different structure-aware loss functions to adequately leverage molecular property and structure information. We also utilized group lasso and sparse group lasso to prune and highlight molecular subgraphs and enhance the structure-specific model explainability for the predicted property difference in molecular activity-cliff pairs. We improved drug property prediction by integrating common and uncommon node information and using sparse group lasso, reducing the average root mean squared error (RMSE) by 12.70%, and achieving the lowest averaged RMSE=0.2551 and the highest PCC=0.9572. Furthermore, applying regularization enhances feature attribution methods that estimate the contribution of each atom in the molecular graphs by boosting global direction scores and atom-level accuracy in atom coloring accuracy, which improves model interpretability in drug discovery pipelines, particularly in investigating important molecular substructures in lead optimization.
Abstract (translated)
可解释的人工智能(XAI)方法在药物发现中被越来越多地应用,用于学习分子表示并识别驱动属性预测的子结构。然而,在构建针对构效关系(SAR)建模的化合物属性预测的端到端可解释机器学习模型时面临许多挑战,例如每个靶标的活性数据有限以及对细微分子变化的敏感性。 为了解决这些问题,我们利用了活性悬崖分子对——即共享相同支架但效力差异显著的化合物——针对三个原癌基因酪氨酸蛋白激酶Src蛋白质(PDB ID分别为1O42、2H8H和4MXO)进行了研究。我们实施了图神经网络(GNN)方法以获取原子级特征信息并预测化合物-蛋白质亲和力(即半数最大抑制浓度,IC50)。此外,我们使用不同结构感知的损失函数来充分利用分子属性和结构信息训练GNN模型。还采用了组套索法和稀疏组套索法来修剪和突出显示分子子图,并增强预测性状差异在分子活性悬崖对中的结构特定模型可解释性。通过整合常见与不常见的节点信息并使用稀疏组套索,我们改善了药物属性的预测,平均均方根误差(RMSE)降低了12.70%,达到了最低平均RMSE=0.2551和最高PCC=0.9572。 此外,应用正则化可以增强特征归因方法,通过提高全局方向评分和原子级精度来估计分子图中每个原子的贡献,并提高药物发现管道中的模型可解释性,尤其是在探究先导化合物优化过程中的重要分子子结构方面。
URL
https://arxiv.org/abs/2507.03318