Paper Reading AI Learner

SPRI: Aligning Large Language Models with Context-Situated Principles

2025-02-05 17:32:29
Hongli Zhan, Muneeza Azmat, Raya Horesh, Junyi Jessy Li, Mikhail Yurochkin

Abstract

Aligning Large Language Models to integrate and reflect human values, especially for tasks that demand intricate human oversight, is arduous since it is resource-intensive and time-consuming to depend on human expertise for context-specific guidance. Prior work has utilized predefined sets of rules or principles to steer the behavior of models (Bai et al., 2022; Sun et al., 2023). However, these principles tend to be generic, making it challenging to adapt them to each individual input query or context. In this work, we present Situated-PRInciples (SPRI), a framework requiring minimal or no human effort that is designed to automatically generate guiding principles in real-time for each input query and utilize them to align each response. We evaluate SPRI on three tasks, and show that 1) SPRI can derive principles in a complex domain-specific task that leads to on-par performance as expert-crafted ones; 2) SPRI-generated principles lead to instance-specific rubrics that outperform prior LLM-as-a-judge frameworks; 3) using SPRI to generate synthetic SFT data leads to substantial improvement on truthfulness. We release our code and model generations at this https URL.

Abstract (translated)

将大型语言模型与人类价值观相结合并反映这些价值,尤其是在需要复杂人工监督的任务中,是一项艰巨的工作。这是因为依赖于人力专业知识来进行特定情境下的指导既耗时又耗费资源。先前的研究工作已经使用预定义的规则或原则来引导模型的行为(Bai等,2022;Sun等,2023)。然而,这些原则往往过于通用化,难以适应每个单独输入查询或具体背景。 在此研究中,我们提出了一个名为Situated-PRInciples (SPRI) 的框架。该框架旨在以最小甚至无需人工干预的情况下实时生成针对每个输入查询的指导性原则,并利用这些原则来调整模型响应。我们在三个任务上评估了 SPRI 框架,并展示了以下几点: 1. SPRI 能够在复杂的专业领域任务中推导出与专家制定的原则相媲美的原则,从而达到类似的表现水平。 2. 由 SPRI 推导出来的原则可以为每个实例生成特定的评分标准,这优于先前基于LLM(大语言模型)作为判别者的框架。 3. 使用 SPRI 来生成合成的数据以用于监督微调 (SFT),能显著提高模型的诚实度。 我们已在[此处](https://example.com)发布了我们的代码和模型输出。

URL

https://arxiv.org/abs/2502.03397

PDF

https://arxiv.org/pdf/2502.03397.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot