Abstract
The oxygen reduction reaction (ORR) catalyst plays a critical role in enhancing fuel cell efficiency, making it a key focus in material science research. However, extracting structured information about ORR catalysts from vast scientific literature remains a significant challenge due to the complexity and diversity of textual data. In this study, we propose a named entity recognition (NER) and relation extraction (RE) approach using DyGIE++ with multiple pre-trained BERT variants, including MatSciBERT and PubMedBERT, to extract ORR catalyst-related information from the scientific literature, which is compiled into a fuel cell corpus for materials informatics (FC-CoMIcs). A comprehensive dataset was constructed manually by identifying 12 critical entities and two relationship types between pairs of the entities. Our methodology involves data annotation, integration, and fine-tuning of transformer-based models to enhance information extraction accuracy. We assess the impact of different BERT variants on extraction performance and investigate the effects of annotation consistency. Experimental evaluations demonstrate that the fine-tuned PubMedBERT model achieves the highest NER F1-score of 82.19% and the MatSciBERT model attains the best RE F1-score of 66.10%. Furthermore, the comparison with human annotators highlights the reliability of fine-tuned models for ORR catalyst extraction, demonstrating their potential for scalable and automated literature analysis. The results indicate that domain-specific BERT models outperform general scientific models like BlueBERT for ORR catalyst extraction.
Abstract (translated)
氧还原反应(ORR)催化剂在提高燃料电池效率方面起着关键作用,因此成为了材料科学研究中的重点。然而,从大量的科学文献中提取关于ORR催化剂的结构化信息仍然是一个重大挑战,这主要是由于文本数据的复杂性和多样性所致。在此研究中,我们提出了一种使用DyGIE++和多种预训练BERT变体(包括MatSciBERT和PubMedBERT)进行命名实体识别(NER)与关系抽取(RE),以从科学文献中提取ORR催化剂相关信息的方法,并将这些信息整合到一个燃料电池材料信息语料库(FC-CoMIcs)中。我们手动构建了一个全面的数据集,该数据集中包含了12个关键实体和两个实体对之间的关系类型。我们的方法包括数据标注、集成以及基于转换器模型的微调,以提高信息提取精度。我们评估了不同BERT变体对提取性能的影响,并研究了注释一致性的影响。实验结果表明,经过微调后的PubMedBERT模型在NER方面取得了最高的F1值82.19%,而MatSciBERT模型则在RE方面达到了最佳的F1值66.10%。此外,与人工标注者的比较突显了这些细调模型用于提取ORR催化剂信息的可靠性,并展示了它们进行大规模自动化文献分析的巨大潜力。研究结果表明,在ORR催化剂提取方面,特定领域的BERT模型优于如BlueBERT等通用科学模型。
URL
https://arxiv.org/abs/2507.07499