Improving Language Models with Advantage-based Offline Policy Gradients

Abstract
Abstract (translated)
URL
PDF

Abstract

Improving language model generations according to some user-defined quality or style constraints is challenging. Typical approaches include learning on additional human-written data, filtering ``low-quality'' data using heuristics and/or using reinforcement learning with human feedback (RLHF). However, filtering can remove valuable training signals, whereas data collection and RLHF constantly require additional human-written or LM exploration data which can be costly to obtain. A natural question to ask is ``Can we leverage RL to optimize LM utility on existing crowd-sourced and internet data?'' To this end, we present Left-over Lunch RL (LoL-RL), a simple training algorithm that uses offline policy gradients for learning language generation tasks as a 1-step RL game. LoL-RL can finetune LMs to optimize arbitrary classifier-based or human-defined utility functions on any sequence-to-sequence data. Experiments with five different language generation tasks using models of varying sizes and multiple rewards show that models trained with LoL-RL can consistently outperform the best supervised learning models. We also release our experimental code. this https URL

Abstract (translated)

提高语言模型生成根据某些用户定义的质量或风格限制是一项挑战性的任务。典型的方法包括利用额外的人类编写的数据学习，使用启发式方法过滤“低质量”的数据，或使用人类反馈的强化学习方法(RLHF)。然而，过滤可以删除宝贵的训练信号，而数据收集和RLHF经常需要额外的人类编写或LM探索数据，这些数据可能非常昂贵。一个自然的问题是：“我们如何利用RL优化LM在现有的 crowd-sourced 和互联网数据中的 utility?”为此，我们介绍了剩余午餐RL(LoL-RL)，这是一个简单的训练算法，使用离线决策梯度学习语言生成任务，并将其作为1步RL游戏。LoL-RL可以优化LMs，以在任意分类基于或人类定义的 utility函数的任何序列到序列数据上优化任意分类器或人类定义的 utility函数。使用不同大小模型和多个奖励的实验，使用了五个不同的语言生成任务，表明训练使用LoL-RL的模型可以 consistently outperform 最好的监督学习模型。我们还发布了我们的实验代码。 this https URL 是本网站提供的实验代码。

URL

https://arxiv.org/abs/2305.14718

PDF

https://arxiv.org/pdf/2305.14718.pdf