Abstract
PURPOSE OR GOAL: This study investigates how GenAI can be integrated with a criterion-referenced grading framework to improve the efficiency and quality of grading for mathematical assessments in engineering. It specifically explores the challenges demonstrators face with manual, model solution-based grading and how a GenAI-supported system can be designed to reliably identify student errors, provide high-quality feedback, and support human graders. The research also examines human graders' perceptions of the effectiveness of this GenAI-assisted approach. ACTUAL OR ANTICIPATED OUTCOMES: The study found that GenAI achieved an overall grading accuracy of 92.5%, comparable to two experienced human graders. The two researchers, who also served as subject demonstrators, perceived the GenAI as a helpful second reviewer that improved accuracy by catching small errors and provided more complete feedback than they could manually. A central outcome was the significant enhancement of formative feedback. However, they noted the GenAI tool is not yet reliable enough for autonomous use, especially with unconventional solutions. CONCLUSIONS/RECOMMENDATIONS/SUMMARY: This study demonstrates that GenAI, when paired with a structured, criterion-referenced framework using binary questions, can grade engineering mathematical assessments with an accuracy comparable to human experts. Its primary contribution is a novel methodological approach that embeds the generation of high-quality, scalable formative feedback directly into the assessment workflow. Future work should investigate student perceptions of GenAI grading and feedback.
Abstract (translated)
**研究目的或目标:** 本研究探讨了如何将通用人工智能(GenAI)与基于标准的评分框架相结合,以提高工程数学评估中的评分效率和质量。该研究特别关注手动评分者在使用模型解决方案进行评分时面临的挑战,并探索设计一个由GenAI支持的系统来可靠地识别学生错误、提供高质量反馈以及辅助人类评分者的可能性。此外,该研究还考察了人工评分员对这一基于GenAI的方法有效性的看法。 **实际或预期成果:** 研究表明,GenAI在数学评估中的总体评分准确率为92.5%,与两位经验丰富的手动评分者的表现相当。担任研究对象演示者的两名研究人员认为,GenAI可以作为有效的第二评审人来提高准确性,并提供比他们手动操作时更全面的反馈。其中一项重要成果是形成了更为有效的形成性反馈机制。然而,研究人员也指出,该GenAI工具尚未达到自主使用的可靠性水平,尤其是在处理非常规解决方案时。 **结论/建议/总结:** 本研究表明,在使用结构化、基于标准的方法(特别是采用二元问题)的情况下,将通用人工智能与工程数学评估的评分相结合可以实现与专家级人类评分员相当的准确度。GenAI的主要贡献在于提出了一种新颖的方法学方法,该方法直接在评估流程中嵌入高质量且可扩展的形成性反馈生成机制。未来的研究应该关注学生对基于GenAI的评分和反馈的看法。 --- 这项研究为利用通用人工智能改善工程数学评估中的自动评分提供了重要见解,并强调了进一步开发和完善这一技术以提高其可靠性和适用性的必要性。
URL
https://arxiv.org/abs/2601.15626