Paper Reading AI Learner

EnzChemRED, a rich enzyme chemistry relation extraction dataset

2024-04-22 14:18:34
Po-Ting Lai, Elisabeth Coudert, Lucila Aimo, Kristian Axelsen, Lionel Breuza, Edouard de Castro, Marc Feuermann, Anne Morgat, Lucille Pourcel, Ivo Pedruzzi, Sylvain Poux, Nicole Redaschi, Catherine Rivoire, Anastasia Sveshnikova, Chih-Hsuan Wei, Robert Leaman, Ling Luo, Zhiyong Lu, Alan Bridge

Abstract

Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts in which enzymes and the chemical reactions they catalyze are annotated using identifiers from the UniProt Knowledgebase (UniProtKB) and the ontology of Chemical Entities of Biological Interest (ChEBI). We show that fine-tuning pre-trained language models with EnzChemRED can significantly boost their ability to identify mentions of proteins and chemicals in text (Named Entity Recognition, or NER) and to extract the chemical conversions in which they participate (Relation Extraction, or RE), with average F1 score of 86.30% for NER, 86.66% for RE for chemical conversion pairs, and 83.79% for RE for chemical conversion pairs and linked enzymes. We combine the best performing methods after fine-tuning using EnzChemRED to create an end-to-end pipeline for knowledge extraction from text and apply this to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea. The EnzChemRED corpus is freely available at this https URL.

Abstract (translated)

专家策展对于从FAIR开放知识库中捕获酶功能知识至关重要,但无法跟上新发现和新出版物的发展速度。在这项工作中,我们提出了EnzChemRED,Enzyme Chemistry Relation Extraction Dataset的训练和基准数据集,以支持开发自然语言处理(NLP)方法,如(大型)语言模型,以协助酶策展。EnzChemRED由1,210个专家编写的PubMed摘要组成,其中酶及其催化的化学反应使用来自UniProt知识库(UniProtKB)和化学生物实体(ChEBI)的标识符进行注释。我们证明了使用EnzChemRED对预训练语言模型进行微调可以显著提高其在文本(命名实体识别,NER)中识别蛋白质和化学物质的提及能力以及提取它们参与的化学转换(关系提取,RE)能力,平均F1分数为86.30% for NER,86.66% for RE for chemical conversion pairs,83.79% for RE for chemical conversion pairs and linked enzymes。我们使用EnzChemRED中表现最好的方法对文本进行微调,创建了从文本到摘要的端到端管道,并将此应用于PubMed大小的摘要以创建酶功能文献的初步映射,以指导在UniProtKB和反应知识库Rhea中的策展工作。EnzChemRED语料库可在此链接处免费获取:https://www.ncbi.nlm.nih.gov/25962541

URL

https://arxiv.org/abs/2404.14209

PDF

https://arxiv.org/pdf/2404.14209.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot