Abstract
Automated explanatory feedback systems play a crucial role in facilitating learning for a large cohort of learners by offering feedback that incorporates explanations, significantly enhancing the learning process. However, delivering such explanatory feedback in real-time poses challenges, particularly when high classification accuracy for domain-specific, nuanced responses is essential. Our study leverages the capabilities of large language models, specifically Generative Pre-Trained Transformers (GPT), to explore a sequence labeling approach focused on identifying components of desired and less desired praise for providing explanatory feedback within a tutor training dataset. Our aim is to equip tutors with actionable, explanatory feedback during online training lessons. To investigate the potential of GPT models for providing the explanatory feedback, we employed two commonly-used approaches: prompting and fine-tuning. To quantify the quality of highlighted praise components identified by GPT models, we introduced a Modified Intersection over Union (M-IoU) score. Our findings demonstrate that: (1) the M-IoU score effectively correlates with human judgment in evaluating sequence quality; (2) using two-shot prompting on GPT-3.5 resulted in decent performance in recognizing effort-based (M-IoU of 0.46) and outcome-based praise (M-IoU of 0.68); and (3) our optimally fine-tuned GPT-3.5 model achieved M-IoU scores of 0.64 for effort-based praise and 0.84 for outcome-based praise, aligning with the satisfaction levels evaluated by human coders. Our results show promise for using GPT models to provide feedback that focuses on specific elements in their open-ended responses that are desirable or could use improvement.
Abstract (translated)
自动解释性反馈系统在促进大规模学习群体的学习方面发挥了关键作用,通过提供包含解释性的反馈来显著提高学习过程。然而,在实时提供这样的解释性反馈方面存在挑战,特别是在对领域特定、细微的回应进行高分类准确度要求时。我们的研究利用大型语言模型的能力,特别是生成预训练转换器(GPT),探讨了一种专注于在导师训练数据集中的识别所需和不需要赞扬的组件的序列标注方法,为导师提供在线培训课程中的行动式、解释性反馈。我们的目标是向导师提供解释性反馈,以便在在线培训课程中进行。为了研究GPT模型的提供解释性反馈的潜力,我们采用了两种常用的方法:提示和微调。为了量化GPT模型确定的突出赞扬部分的品质,我们引入了modified Intersection over Union(M-IoU)分数。我们的研究结果表明: (1)M-IoU分数有效地与人类评价序列质量的程度相关; (2)在GPT-3.5上使用两击提示产生了 decent的性能,以识别基于努力(M-IoU为0.46)和基于结果(M-IoU为0.68)的赞扬; (3)我们通过微调GPT-3.5模型,实现了基于努力赞扬的M-IoU score为0.64和基于结果赞扬的M-IoU score为0.84,与人类编码者评估的水平相符。 我们的研究结果表明,使用GPT模型提供关注其开放性回应中具体元素的反馈具有前景。
URL
https://arxiv.org/abs/2405.00291