Abstract
This study introduces a systematic framework to compare the efficacy of Large Language Models (LLMs) for fine-tuning across various cheminformatics tasks. Employing a uniform training methodology, we assessed three well-known models-RoBERTa, BART, and LLaMA-on their ability to predict molecular properties using the Simplified Molecular Input Line Entry System (SMILES) as a universal molecular representation format. Our comparative analysis involved pre-training 18 configurations of these models, with varying parameter sizes and dataset scales, followed by fine-tuning them on six benchmarking tasks from DeepChem. We maintained consistent training environments across models to ensure reliable comparisons. This approach allowed us to assess the influence of model type, size, and training dataset size on model performance. Specifically, we found that LLaMA-based models generally offered the lowest validation loss, suggesting their superior adaptability across tasks and scales. However, we observed that absolute validation loss is not a definitive indicator of model performance - contradicts previous research - at least for fine-tuning tasks: instead, model size plays a crucial role. Through rigorous replication and validation, involving multiple training and fine-tuning cycles, our study not only delineates the strengths and limitations of each model type but also provides a robust methodology for selecting the most suitable LLM for specific cheminformatics applications. This research underscores the importance of considering model architecture and dataset characteristics in deploying AI for molecular property prediction, paving the way for more informed and effective utilization of AI in drug discovery and related fields.
Abstract (translated)
本研究建立了一个系统性的框架,以比较大型语言模型(LLMs)在各种药物化学任务上的效果。采用统一的训练方法,我们评估了三种广为人知的模型-RoBERTa、BART和LLaMA-使用Simplified Molecular Input Line Entry System(SMILES)作为通用分子表示格式预测分子性质的能力。我们的比较分析涉及预训练这些模型的18个配置,具有不同的参数大小和数据集规模,然后将它们在DeepChem的六个基准任务上进行微调。我们保持了模型之间的训练环境一致,以确保可靠的比较。这种方法让我们能够评估模型类型、大小和训练数据集大小对模型性能的影响。具体来说,我们发现基于LLaMA的模型通常具有最低的验证损失,表明其在任务和规模上的优越适应性。然而,我们观察到,绝对验证损失并不是衡量模型性能的最终指标-至少在微调任务上是如此。相反,模型大小在选择最合适的LLM为特定药物化学应用时发挥着至关重要的作用。通过严谨的重复和验证,包括多个训练和微调周期,我们的研究不仅揭示了每种模型类型的优势和局限性,还为选择最合适的LLM为特定药物化学应用提供了稳健的评估方法。这项研究强调了在部署人工智能进行分子 property预测时考虑模型架构和数据集特征的重要性,为药物发现和相关领域更明智和有效的利用人工智能铺平道路。
URL
https://arxiv.org/abs/2405.00949