Paper Reading AI Learner

Aligning language models with human preferences

2024-04-18 12:55:18
Tomasz Korbak

Abstract

Language models (LMs) trained on vast quantities of text data can acquire sophisticated skills such as generating summaries, answering questions or generating code. However, they also manifest behaviors that violate human preferences, e.g., they can generate offensive content, falsehoods or perpetuate social biases. In this thesis, I explore several approaches to aligning LMs with human preferences. First, I argue that aligning LMs can be seen as Bayesian inference: conditioning a prior (base, pretrained LM) on evidence about human preferences (Chapter 2). Conditioning on human preferences can be implemented in numerous ways. In Chapter 3, I investigate the relation between two approaches to finetuning pretrained LMs using feedback given by a scoring function: reinforcement learning from human feedback (RLHF) and distribution matching. I show that RLHF can be seen as a special case of distribution matching but distributional matching is strictly more general. In chapter 4, I show how to extend the distribution matching to conditional language models. Finally, in chapter 5 I explore a different root: conditioning an LM on human preferences already during pretraining. I show that involving human feedback from the very start tends to be more effective than using it only during supervised finetuning. Overall, these results highlight the room for alignment techniques different from and complementary to RLHF.

Abstract (translated)

语言模型(LMs)通过大量文本数据进行训练可以获得复杂的技能,如生成概述、回答问题或生成代码。然而,它们也表现出了违反人类偏好的行为,例如生成具有攻击性的内容、虚假信息或传播社会偏见。在这篇论文中,我探讨了几种将LMs与人类偏好对齐的方法。首先,我认为将LMs对齐可以看作是贝叶斯推理:通过给定关于人类偏好的证据来条件化先验(基础,预训练LM)(第2章)。通过人类偏好进行条件可以以多种方式实现。在第3章中,我研究了使用评分函数给反馈的两种方法:基于人类反馈的强化学习(RLHF)和分布匹配。我表明,RLHF可以被视为分布匹配的特殊情况,但分布匹配比它更一般。在第4章中,我展示了如何将分布匹配扩展到条件语言模型。最后,在第5章中,我探讨了另一种根源:在预训练过程中将LM对齐于人类偏好。我表明,从从一开始涉及人类反馈往往比仅在监督微调过程中使用它更有效。总体而言,这些结果突出了与RLHF不同的、互补的alignment技术。

URL

https://arxiv.org/abs/2404.12150

PDF

https://arxiv.org/pdf/2404.12150.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot