Abstract
We introduce a new area of study in the field of educational Natural Language Processing: Automated Long Answer Grading (ALAG). Distinguishing itself from Automated Short Answer Grading (ASAG) and Automated Essay Grading (AEG), ALAG presents unique challenges due to the complexity and multifaceted nature of fact-based long answers. To study ALAG, we introduce RiceChem, a dataset derived from a college chemistry course, featuring real student responses to long-answer questions with an average word count notably higher than typical ASAG datasets. We propose a novel approach to ALAG by formulating it as a rubric entailment problem, employing natural language inference models to verify whether each criterion, represented by a rubric item, is addressed in the student's response. This formulation enables the effective use of MNLI for transfer learning, significantly improving the performance of models on the RiceChem dataset. We demonstrate the importance of rubric-based formulation in ALAG, showcasing its superiority over traditional score-based approaches in capturing the nuances of student responses. We also investigate the performance of models in cold start scenarios, providing valuable insights into the practical deployment considerations in educational settings. Lastly, we benchmark state-of-the-art open-sourced Large Language Models (LLMs) on RiceChem and compare their results to GPT models, highlighting the increased complexity of ALAG compared to ASAG. Despite leveraging the benefits of a rubric-based approach and transfer learning from MNLI, the lower performance of LLMs on RiceChem underscores the significant difficulty posed by the ALAG task. With this work, we offer a fresh perspective on grading long, fact-based answers and introduce a new dataset to stimulate further research in this important area. Code: \url{this https URL}.
Abstract (translated)
我们在教育自然语言处理领域引入了一个新的研究领域:自动长答案评分(ALAG)。与自动短答案评分(ASAG)和自动论文评分(AEG)不同,ALAG因为基于事实的长答案的复杂性和多面性而面临着独特的挑战。为了研究ALAG,我们引入了 RiceChem 数据集,这是一个来源于大学化学课程的数据集,其中真实学生对长答案问题的回答平均单词数明显高于典型的ASAG数据集。我们通过将ALAG公式化为一个评分表约束问题,并使用自然语言推理模型来验证每个评分表项目是否在学生回答中得到解决,从而提出了一种新颖的ALAG方法。这一方法使得MNLI在迁移学习中有更好的效果,显著提高了在RiceChem数据集上的模型性能。我们展示了基于评分表公式的ALAG在ALAG中的重要性,并探讨了在学生反应中捕捉细微差别的效果。最后,我们在RiceChem上 benchmark了最先进的开源大型语言模型(LLMs),并将它们的结果与GPT模型进行比较,突出了ALAG相对于ASAG的增加复杂性。尽管利用了基于评分表的方法和MNLI的迁移学习优势,但LLMs在RiceChem上的表现仍然较低,这表明ALAG任务所提出的困难程度。通过这项工作,我们提供了一个对评分长、基于事实的答案的新视角,并引入了一个新的数据集,以激发进一步研究这个重要领域的兴趣。代码:\url{这个 <https://this <https://this URL>.
URL
https://arxiv.org/abs/2404.14316