Abstract
Structured science summaries or research contributions using properties or dimensions beyond traditional keywords enhances science findability. Current methods, such as those used by the Open Research Knowledge Graph (ORKG), involve manually curating properties to describe research papers' contributions in a structured manner, but this is labor-intensive and inconsistent between the domain expert human curators. We propose using Large Language Models (LLMs) to automatically suggest these properties. However, it's essential to assess the readiness of LLMs like GPT-3.5, Llama 2, and Mistral for this task before application. Our study performs a comprehensive comparative analysis between ORKG's manually curated properties and those generated by the aforementioned state-of-the-art LLMs. We evaluate LLM performance through four unique perspectives: semantic alignment and deviation with ORKG properties, fine-grained properties mapping accuracy, SciNCL embeddings-based cosine similarity, and expert surveys comparing manual annotations with LLM outputs. These evaluations occur within a multidisciplinary science setting. Overall, LLMs show potential as recommendation systems for structuring science, but further finetuning is recommended to improve their alignment with scientific tasks and mimicry of human expertise.
Abstract (translated)
使用超越传统关键词的属性或维度来结构化科学摘要或研究贡献可以提高科学可查找性。目前的方法,如Open Research Knowledge Graph (ORKG)中所使用的,需要手动编辑属性以描述研究论文的贡献,但这是劳动密集型且与领域专家人类编者之间存在不一致性。我们提出使用大型语言模型(LLMs)来自动建议这些属性。然而,在应用之前评估LLMs(如GPT-3.5、Llama 2和Mistral)的准备情况至关重要。 我们的研究对ORKG手动编辑的属性和上述最先进的LLM生成的属性进行了全面比较分析。我们通过四个独特的视角来评估LLM性能:语义对齐和与ORKG属性之间的偏移,细粒度属性映射准确度,基于SciNCL嵌入的余弦相似度,以及专家调查与LLM输出之间的比较。这些评估发生在多学科科学环境中。 总的来说,LLMs在构建科学推荐系统方面具有潜力,但需要进一步的微调以改善其与科学任务的同步性和模拟人类专业知识的能力。
URL
https://arxiv.org/abs/2405.02105