Weak-to-Strong Extrapolation Expedites Alignment

Abstract
Abstract (translated)
URL
PDF

Abstract

Although the capabilities of large language models (LLMs) ideally scale up with increasing data and compute, they are inevitably constrained by limited resources in reality. Suppose we have a moderately trained LLM (e.g., trained to align with human preference) in hand, can we further exploit its potential and cheaply acquire a stronger model? In this paper, we propose a simple method called ExPO to boost LLMs' alignment with human preference. ExPO assumes that a medium-aligned model can be interpolated between a less-aligned (weaker) model, e.g., the initial SFT model, and a better-aligned (stronger) one, thereby directly obtaining this stronger model by extrapolating from the weights of the former two relatively weaker models. On the AlpacaEval 2.0 benchmark, we show that ExPO pushes models trained with less preference data (e.g., 10% or 20%) to reach and even surpass the fully-trained one, without any additional training. Furthermore, ExPO also significantly improves off-the-shelf DPO/RLHF models and exhibits decent scalability across model sizes from 7B to 70B. Our work demonstrates the efficacy of model extrapolation in exploiting LLMs' capabilities, suggesting a promising direction that deserves future exploration.

Abstract (translated)

尽管大型语言模型（LLMs）在理想情况下能够随着数据和计算能力的增加而扩展其能力，但它们在现实中受到有限资源的限制。假设我们手中有一个适度训练的LLM（例如，训练以与人类偏好对齐），我们能否进一步发掘其潜力并以较低的成本获得更强的模型？在本文中，我们提出了一个简单的方法叫做ExPO，用于提高LLMs与人类偏好的对齐程度。ExPO假设一个中庸对齐的模型可以平滑地存在于一个较不满意的（较弱）模型和更好对齐的（较强）模型之间，从而通过从这两个较弱模型的权重中进行扩展直接获得这个更强的模型。在AlpacaEval 2.0基准上，我们证明了ExPO将偏好数据较少的模型（例如10%或20%）推向并甚至超过完全训练的模型，而没有任何额外的训练。此外，ExPO还显著地改善了标准DPO/RLHF模型，并在模型规模从7B到70B时表现出良好的可扩展性。我们的工作表明，模型扩展在利用LLM的能力方面具有有效性，为未来的探索提供了一个有前景的方向。

URL

https://arxiv.org/abs/2404.16792

PDF

https://arxiv.org/pdf/2404.16792.pdf

Weak-to-Strong Extrapolation Expedites Alignment

Abstract

Abstract (translated)

URL

PDF Copy

PDF