Abstract
Language models (LMs) trained on vast quantities of text data can acquire sophisticated skills such as generating summaries, answering questions or generating code. However, they also manifest behaviors that violate human preferences, e.g., they can generate offensive content, falsehoods or perpetuate social biases. In this thesis, I explore several approaches to aligning LMs with human preferences. First, I argue that aligning LMs can be seen as Bayesian inference: conditioning a prior (base, pretrained LM) on evidence about human preferences (Chapter 2). Conditioning on human preferences can be implemented in numerous ways. In Chapter 3, I investigate the relation between two approaches to finetuning pretrained LMs using feedback given by a scoring function: reinforcement learning from human feedback (RLHF) and distribution matching. I show that RLHF can be seen as a special case of distribution matching but distributional matching is strictly more general. In chapter 4, I show how to extend the distribution matching to conditional language models. Finally, in chapter 5 I explore a different root: conditioning an LM on human preferences already during pretraining. I show that involving human feedback from the very start tends to be more effective than using it only during supervised finetuning. Overall, these results highlight the room for alignment techniques different from and complementary to RLHF.
Abstract (translated)
语言模型(LMs)通过大量文本数据进行训练可以获得复杂的技能,如生成概述、回答问题或生成代码。然而,它们也表现出了违反人类偏好的行为,例如生成具有攻击性的内容、虚假信息或传播社会偏见。在这篇论文中,我探讨了几种将LMs与人类偏好对齐的方法。首先,我认为将LMs对齐可以看作是贝叶斯推理:通过给定关于人类偏好的证据来条件化先验(基础,预训练LM)(第2章)。通过人类偏好进行条件可以以多种方式实现。在第3章中,我研究了使用评分函数给反馈的两种方法:基于人类反馈的强化学习(RLHF)和分布匹配。我表明,RLHF可以被视为分布匹配的特殊情况,但分布匹配比它更一般。在第4章中,我展示了如何将分布匹配扩展到条件语言模型。最后,在第5章中,我探讨了另一种根源:在预训练过程中将LM对齐于人类偏好。我表明,从从一开始涉及人类反馈往往比仅在监督微调过程中使用它更有效。总体而言,这些结果突出了与RLHF不同的、互补的alignment技术。
URL
https://arxiv.org/abs/2404.12150