Paper Reading AI Learner

Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels

2024-04-22 16:20:36
Jan-Philipp Fränken, Eric Zelikman, Rafael Rafailov, Kanishk Gandhi, Tobias Gerstenberg, Noah D. Goodman

Abstract

When prompting a language model (LM), users frequently expect the model to adhere to a set of behavioral principles across diverse tasks, such as producing insightful content while avoiding harmful or biased language. Instilling such principles into a model can be resource-intensive and technically challenging, generally requiring human preference labels or examples. We introduce SAMI, a method for teaching a pretrained LM to follow behavioral principles that does not require any preference labels or demonstrations. SAMI is an iterative algorithm that finetunes a pretrained LM to increase the conditional mutual information between constitutions and self-generated responses given queries from a datasest. On single-turn dialogue and summarization, a SAMI-trained mistral-7b outperforms the initial pretrained model, with win rates between 66% and 77%. Strikingly, it also surpasses an instruction-finetuned baseline (mistral-7b-instruct) with win rates between 55% and 57% on single-turn dialogue. SAMI requires a "principle writer" model; to avoid dependence on stronger models, we further evaluate aligning a strong pretrained model (mixtral-8x7b) using constitutions written by a weak instruction-finetuned model (mistral-7b-instruct). The SAMI-trained mixtral-8x7b outperforms both the initial model and the instruction-finetuned model, achieving a 65% win rate on summarization. Our results indicate that a pretrained LM can learn to follow constitutions without using preference labels, demonstrations, or human oversight.

Abstract (translated)

当我们提示语言模型(LM)时,用户通常期望模型在各种任务上遵守一系列行为原则,例如在生成有洞察力的内容的同时避免使用有害或偏见的语言。将这样的原则注入模型可能需要大量的资源和技术挑战,通常需要人类偏好标签或示例。我们介绍了一种名为SAMI的方法,用于教授预训练LM遵循行为原则,而不需要任何偏好标签或演示。SAMI是一个迭代算法,通过优化预训练LM的条件 mutual information 增加,给定查询数据集。在单轮对话和摘要中,经过SAMI训练的mistral-7b在初始预训练模型基础上取得了更优异的胜率,范围在66%到77%之间。令人惊讶的是,它还在单轮对话上超过了指令微调的基线(mistral-7b-instruct),在55%到57%的胜率上超过了它。SAMI需要一个“原则编写者”模型;为了避免对更强大的模型的依赖,我们进一步评估使用弱指令微调的模型(mistral-7b-instruct)编写的constitution的alignment。SAMI训练的mistral-8x7b在摘要中超过了初始模型和指令微调模型,实现了65%的胜率。我们的结果表明,预训练LM可以学习遵循constitution,而无需使用偏好标签、演示或人类监督。

URL

https://arxiv.org/abs/2404.14313

PDF

https://arxiv.org/pdf/2404.14313.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot