Abstract
Aligning Large Language Models to integrate and reflect human values, especially for tasks that demand intricate human oversight, is arduous since it is resource-intensive and time-consuming to depend on human expertise for context-specific guidance. Prior work has utilized predefined sets of rules or principles to steer the behavior of models (Bai et al., 2022; Sun et al., 2023). However, these principles tend to be generic, making it challenging to adapt them to each individual input query or context. In this work, we present Situated-PRInciples (SPRI), a framework requiring minimal or no human effort that is designed to automatically generate guiding principles in real-time for each input query and utilize them to align each response. We evaluate SPRI on three tasks, and show that 1) SPRI can derive principles in a complex domain-specific task that leads to on-par performance as expert-crafted ones; 2) SPRI-generated principles lead to instance-specific rubrics that outperform prior LLM-as-a-judge frameworks; 3) using SPRI to generate synthetic SFT data leads to substantial improvement on truthfulness. We release our code and model generations at this https URL.
Abstract (translated)
将大型语言模型与人类价值观相结合并反映这些价值,尤其是在需要复杂人工监督的任务中,是一项艰巨的工作。这是因为依赖于人力专业知识来进行特定情境下的指导既耗时又耗费资源。先前的研究工作已经使用预定义的规则或原则来引导模型的行为(Bai等,2022;Sun等,2023)。然而,这些原则往往过于通用化,难以适应每个单独输入查询或具体背景。 在此研究中,我们提出了一个名为Situated-PRInciples (SPRI) 的框架。该框架旨在以最小甚至无需人工干预的情况下实时生成针对每个输入查询的指导性原则,并利用这些原则来调整模型响应。我们在三个任务上评估了 SPRI 框架,并展示了以下几点: 1. SPRI 能够在复杂的专业领域任务中推导出与专家制定的原则相媲美的原则,从而达到类似的表现水平。 2. 由 SPRI 推导出来的原则可以为每个实例生成特定的评分标准,这优于先前基于LLM(大语言模型)作为判别者的框架。 3. 使用 SPRI 来生成合成的数据以用于监督微调 (SFT),能显著提高模型的诚实度。 我们已在[此处](https://example.com)发布了我们的代码和模型输出。
URL
https://arxiv.org/abs/2502.03397