Paper Reading AI Learner

Automated Long Answer Grading with RiceChem Dataset

2024-04-22 16:28:09
Shashank Sonkar, Kangqi Ni, Lesa Tran Lu, Kristi Kincaid, John S. Hutchinson, Richard G. Baraniuk

Abstract

We introduce a new area of study in the field of educational Natural Language Processing: Automated Long Answer Grading (ALAG). Distinguishing itself from Automated Short Answer Grading (ASAG) and Automated Essay Grading (AEG), ALAG presents unique challenges due to the complexity and multifaceted nature of fact-based long answers. To study ALAG, we introduce RiceChem, a dataset derived from a college chemistry course, featuring real student responses to long-answer questions with an average word count notably higher than typical ASAG datasets. We propose a novel approach to ALAG by formulating it as a rubric entailment problem, employing natural language inference models to verify whether each criterion, represented by a rubric item, is addressed in the student's response. This formulation enables the effective use of MNLI for transfer learning, significantly improving the performance of models on the RiceChem dataset. We demonstrate the importance of rubric-based formulation in ALAG, showcasing its superiority over traditional score-based approaches in capturing the nuances of student responses. We also investigate the performance of models in cold start scenarios, providing valuable insights into the practical deployment considerations in educational settings. Lastly, we benchmark state-of-the-art open-sourced Large Language Models (LLMs) on RiceChem and compare their results to GPT models, highlighting the increased complexity of ALAG compared to ASAG. Despite leveraging the benefits of a rubric-based approach and transfer learning from MNLI, the lower performance of LLMs on RiceChem underscores the significant difficulty posed by the ALAG task. With this work, we offer a fresh perspective on grading long, fact-based answers and introduce a new dataset to stimulate further research in this important area. Code: \url{this https URL}.

Abstract (translated)

我们在教育自然语言处理领域引入了一个新的研究领域:自动长答案评分(ALAG)。与自动短答案评分(ASAG)和自动论文评分(AEG)不同,ALAG因为基于事实的长答案的复杂性和多面性而面临着独特的挑战。为了研究ALAG,我们引入了 RiceChem 数据集,这是一个来源于大学化学课程的数据集,其中真实学生对长答案问题的回答平均单词数明显高于典型的ASAG数据集。我们通过将ALAG公式化为一个评分表约束问题,并使用自然语言推理模型来验证每个评分表项目是否在学生回答中得到解决,从而提出了一种新颖的ALAG方法。这一方法使得MNLI在迁移学习中有更好的效果,显著提高了在RiceChem数据集上的模型性能。我们展示了基于评分表公式的ALAG在ALAG中的重要性,并探讨了在学生反应中捕捉细微差别的效果。最后,我们在RiceChem上 benchmark了最先进的开源大型语言模型(LLMs),并将它们的结果与GPT模型进行比较,突出了ALAG相对于ASAG的增加复杂性。尽管利用了基于评分表的方法和MNLI的迁移学习优势,但LLMs在RiceChem上的表现仍然较低,这表明ALAG任务所提出的困难程度。通过这项工作,我们提供了一个对评分长、基于事实的答案的新视角,并引入了一个新的数据集,以激发进一步研究这个重要领域的兴趣。代码:\url{这个 <https://this <https://this URL>.

URL

https://arxiv.org/abs/2404.14316

PDF

https://arxiv.org/pdf/2404.14316.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot