Abstract
The Socratic method is a way of guiding students toward solving a problem independently without directly revealing the solution to the problem. Although this method has been shown to significantly improve student learning outcomes, it remains a complex labor-intensive task for instructors. Large language models (LLMs) can be used to augment human effort by automatically generating Socratic questions for students. However, existing methods that involve prompting these LLMs sometimes produce invalid outputs, e.g., those that directly reveal the solution to the problem or provide irrelevant or premature questions. To alleviate this problem, inspired by reinforcement learning with AI feedback (RLAIF), we first propose a data augmentation method to enrich existing Socratic questioning datasets with questions that are invalid in specific ways. Next, we propose a method to optimize open-source LLMs such as LLama 2 to prefer ground-truth questions over generated invalid ones, using direct preference optimization (DPO). Our experiments on a Socratic questions dataset for student code debugging show that a DPO-optimized 7B LLama 2 model can effectively avoid generating invalid questions, and as a result, outperforms existing state-of-the-art prompting methods.
Abstract (translated)
苏格拉底教学法是一种引导学生独立解决问题的方式,而不直接揭示解决问题的方法。尽管这种方法已被证明在提高学生学习成果方面取得了显著的进展,但它仍然对教师来说是一个复杂且劳动密集的任务。大型语言模型(LLMs)可以用于通过自动生成苏格拉底式问题来辅助学生的人力努力。然而,现有的方法中,提示这些LLMs有时会产生无效输出,例如直接揭示问题解决方案或提供不相关或过早的问题。为了减轻这个问题,我们受到人工智能反馈强化学习(RLAIF)的启发,首先提出了一个数据增强方法,为现有的苏格拉底式问题数据集增加在特定方式上无效的问题。接下来,我们提出了一个使用直接偏好优化(DPO)方法优化开源LLM(如LLama 2)的方法,使其倾向于地面真实问题而不是生成无效的问题。我们对学生代码调试的苏格拉底式问题数据集的实验证明,DPO-优化的7B LLama 2模型可以有效地避免生成无效问题,从而超越了现有的提示方法。
URL
https://arxiv.org/abs/2403.00199