Distilling Named Entity Recognition Models for Endangered Species from Large Language Models

Abstract
Abstract (translated)
URL
PDF

Abstract

Natural language processing (NLP) practitioners are leveraging large language models (LLM) to create structured datasets from semi-structured and unstructured data sources such as patents, papers, and theses, without having domain-specific knowledge. At the same time, ecological experts are searching for a variety of means to preserve biodiversity. To contribute to these efforts, we focused on endangered species and through in-context learning, we distilled knowledge from GPT-4. In effect, we created datasets for both named entity recognition (NER) and relation extraction (RE) via a two-stage process: 1) we generated synthetic data from GPT-4 of four classes of endangered species, 2) humans verified the factual accuracy of the synthetic data, resulting in gold data. Eventually, our novel dataset contains a total of 3.6K sentences, evenly divided between 1.8K NER and 1.8K RE sentences. The constructed dataset was then used to fine-tune both general BERT and domain-specific BERT variants, completing the knowledge distillation process from GPT-4 to BERT, because GPT-4 is resource intensive. Experiments show that our knowledge transfer approach is effective at creating a NER model suitable for detecting endangered species from texts.

Abstract (translated)

自然语言处理（NLP）从业者利用大型语言模型（LLM）将半结构化和非结构化数据源（如专利、论文和论据）构建为结构化数据，而无需具备专业知识。同时，生态专家正在寻找各种方法来保护生物多样性。为了为这些努力做出贡献，我们专注于濒危物种，并通过上下文学习从GPT-4中提炼知识。实际上，我们通过两个阶段创建了数据集：1）我们从GPT-4的四个濒危物种生成了合成数据；2）人类验证了合成数据的准确性，从而获得了金数据。最终，我们的新数据集包含3.6K个句子，其中1.8K个用于命名实体识别（NER）和1.8K个用于关系提取（RE）。构建的数据集随后用于微调 both general BERT 和 domain-specific BERT 版本，完成从GPT-4到BERT的 knowledge distillation 过程，因为GPT-4资源密集。实验结果表明，我们的知识传递方法在从文本中检测濒危物种方面是有效的。

URL

https://arxiv.org/abs/2403.15430

PDF

https://arxiv.org/pdf/2403.15430.pdf

Distilling Named Entity Recognition Models for Endangered Species from Large Language Models

Abstract

Abstract (translated)

URL

PDF Copy

PDF