Abstract
Recent years have seen development of descriptor generation based on representation learning of extremely diverse molecules, especially those that apply natural language processing (NLP) models to SMILES, a literal representation of molecular structure. However, little research has been done on how these models understand chemical structure. To address this, we investigated the relationship between the learning progress of SMILES and chemical structure using a representative NLP model, the Transformer. The results suggest that while the Transformer learns partial structures of molecules quickly, it requires extended training to understand overall structures. Consistently, the accuracy of molecular property predictions using descriptors generated from models at different learning steps was similar from the beginning to the end of training. Furthermore, we found that the Transformer requires particularly long training to learn chirality and sometimes stagnates with low translation accuracy due to misunderstanding of enantiomers. These findings are expected to deepen understanding of NLP models in chemistry.
Abstract (translated)
近年来,基于非常不同的分子表示学习特征表示的发展,特别是对于那些将自然语言处理(NLP)模型应用于SMIDLES(分子结构 literal 表示)的分子结构表示的学习。然而,关于这些模型如何理解化学结构的研究较少。为了解决这个问题,我们使用Transformer代表的NLP模型研究了SMIDLES的学习进展与化学结构之间的关系。结果表明,虽然Transformer很快学习分子的局部结构,但它需要更长的训练时间来理解整体结构。同样,使用从不同学习步骤中生成的特征表示进行分子属性预测的准确性从开始到结束的训练期间是一致的。此外,我们发现Transformer需要特别长的训练来学习 chirality,有时因为对同工效应的误解而停滞不前。这些发现期望加深对在化学中的NLP模型的理解。
URL
https://arxiv.org/abs/2303.11593