Abstract
In this study, we address one of the challenges of developing NER models for scholarly domains, namely the scarcity of suitable labeled data. We experiment with an approach using predictions from a fine-tuned LLM model to aid non-domain experts in annotating scientific entities within astronomy literature, with the goal of uncovering whether such a collaborative process can approximate domain expertise. Our results reveal moderate agreement between a domain expert and the LLM-assisted non-experts, as well as fair agreement between the domain expert and the LLM model's predictions. In an additional experiment, we compare the performance of finetuned and default LLMs on this task. We have also introduced a specialized scientific entity annotation scheme for astronomy, validated by a domain expert. Our approach adopts a scholarly research contribution-centric perspective, focusing exclusively on scientific entities relevant to the research theme. The resultant dataset, containing 5,000 annotated astronomy article titles, is made publicly available.
Abstract (translated)
在这项研究中,我们研究了在学术领域开发自然语言实体识别(NER)模型的一个挑战:合适标注数据的稀缺性。我们尝试使用来自微调的LLM模型的预测来帮助非领域专家在天文文学中注释科学实体,以揭示是否可以这样一个合作过程可以 approximate领域专业知识。我们的结果表明,领域专家和LLM辅助的非专家在科学实体注释方面存在适度一致性,以及领域专家和LLM模型的预测之间的公平一致性。在另一个实验中,我们比较了微调和默认LLM模型在这项任务上的表现。我们还引入了一个由领域专家验证的专门的天文学科学实体注释方案。我们的方法采用研究主题为中心的视角,专注于与研究主题相关的科学实体。由此产生的数据集,包含5,000个注释的天文学文章标题,已经公开发布。
URL
https://arxiv.org/abs/2405.02602