Abstract
Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts in which enzymes and the chemical reactions they catalyze are annotated using identifiers from the UniProt Knowledgebase (UniProtKB) and the ontology of Chemical Entities of Biological Interest (ChEBI). We show that fine-tuning pre-trained language models with EnzChemRED can significantly boost their ability to identify mentions of proteins and chemicals in text (Named Entity Recognition, or NER) and to extract the chemical conversions in which they participate (Relation Extraction, or RE), with average F1 score of 86.30% for NER, 86.66% for RE for chemical conversion pairs, and 83.79% for RE for chemical conversion pairs and linked enzymes. We combine the best performing methods after fine-tuning using EnzChemRED to create an end-to-end pipeline for knowledge extraction from text and apply this to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea. The EnzChemRED corpus is freely available at this https URL.
Abstract (translated)
专家策展对于从FAIR开放知识库中捕获酶功能知识至关重要,但无法跟上新发现和新出版物的发展速度。在这项工作中,我们提出了EnzChemRED,Enzyme Chemistry Relation Extraction Dataset的训练和基准数据集,以支持开发自然语言处理(NLP)方法,如(大型)语言模型,以协助酶策展。EnzChemRED由1,210个专家编写的PubMed摘要组成,其中酶及其催化的化学反应使用来自UniProt知识库(UniProtKB)和化学生物实体(ChEBI)的标识符进行注释。我们证明了使用EnzChemRED对预训练语言模型进行微调可以显著提高其在文本(命名实体识别,NER)中识别蛋白质和化学物质的提及能力以及提取它们参与的化学转换(关系提取,RE)能力,平均F1分数为86.30% for NER,86.66% for RE for chemical conversion pairs,83.79% for RE for chemical conversion pairs and linked enzymes。我们使用EnzChemRED中表现最好的方法对文本进行微调,创建了从文本到摘要的端到端管道,并将此应用于PubMed大小的摘要以创建酶功能文献的初步映射,以指导在UniProtKB和反应知识库Rhea中的策展工作。EnzChemRED语料库可在此链接处免费获取:https://www.ncbi.nlm.nih.gov/25962541
URL
https://arxiv.org/abs/2404.14209