Abstract
Large language models excel at a variety of language tasks when prompted with examples or instructions. Yet controlling these models through prompting alone is limited. Tailoring language models through fine-tuning (e.g., via reinforcement learning) can be effective, but it is expensive and requires model access. We propose Inference-time Policy Adapters (IPA), which efficiently tailors a language model such as GPT-3 without fine-tuning it. IPA guides a large base model during decoding time through a lightweight policy adaptor trained to optimize an arbitrary user objective with reinforcement learning. On five challenging text generation tasks, such as toxicity reduction and open-domain generation, IPA consistently brings significant improvements over off-the-shelf language models. It outperforms competitive baseline methods, sometimes even including expensive fine-tuning. In particular, tailoring GPT-2 with IPA can outperform GPT-3, while tailoring GPT- 3 with IPA brings a major performance boost over GPT-3 (and sometimes even over GPT-4). Our promising results highlight the potential of IPA as a lightweight alternative to tailoring extreme-scale language models.
Abstract (translated)
大型语言模型在提示示例或指令的情况下在各种语言任务中表现出色。然而,仅通过提示来控制这些模型是有限的。通过 fine-tuning 语言模型(例如通过强化学习)可以有效优化,但它很昂贵,需要模型访问。我们提出了 Inference-time Policy Adapters (IPA),它能够高效地定制 GPT-3 等标准语言模型,而不需要 fine-tuning。在解码时间中,IPA 通过一个 lightweight policy adaptor 训练来指导一个大型基础模型,以通过强化学习优化任意用户目标。在五个具有挑战性的文字生成任务中,例如毒性减少和开放主题生成中,IPA consistently 带来显著改进,它比标准语言模型表现更好,有时甚至包括昂贵的 fine-tuning。特别是,通过定制 GPT-2 与 IPA 可以比 GPT-3 表现更好,而通过定制 GPT-3 与 IPA 则能够显著增强 GPT-3 的性能(有时甚至超过 GPT-4)。我们的 promising 结果是强调 IPA 作为 lightweight 替代方案,用于定制极端规模的语言模型的潜力。
URL
https://arxiv.org/abs/2305.15065