Abstract
Numerous capability and safety techniques of Large Language Models (LLMs), including RLHF, automated red-teaming, prompt engineering, and infilling, can be cast as sampling from an unnormalized target distribution defined by a given reward or potential function over the full sequence. In this work, we leverage the rich toolkit of Sequential Monte Carlo (SMC) for these probabilistic inference problems. In particular, we use learned twist functions to estimate the expected future value of the potential at each timestep, which enables us to focus inference-time computation on promising partial sequences. We propose a novel contrastive method for learning the twist functions, and establish connections with the rich literature of soft reinforcement learning. As a complementary application of our twisted SMC framework, we present methods for evaluating the accuracy of language model inference techniques using novel bidirectional SMC bounds on the log partition function. These bounds can be used to estimate the KL divergence between the inference and target distributions in both directions. We apply our inference evaluation techniques to show that twisted SMC is effective for sampling undesirable outputs from a pretrained model (a useful component of harmlessness training and automated red-teaming), generating reviews with varied sentiment, and performing infilling tasks.
Abstract (translated)
大语言模型(LLMs)具有许多能力和安全技术,包括强化学习(RLHF)、自动红色代理、提示工程和填充,可以将其视为对给定奖励或潜在函数定义的规范化目标分布的采样。在这项工作中,我们利用Sequential Monte Carlo(SMC)的丰富工具箱解决这些概率推理问题。特别是,我们使用学习来的扭曲函数来估计每个时间步的潜在价值的期望,这使得我们在推理时间内专注于有前景的局部序列。我们提出了一个新颖的对比学习方法来学习扭曲函数,并建立了与软强化学习丰富文献的联系。作为我们扭曲SMC框架的补充应用,我们提出了使用新颖的双向SMC边界来评估语言模型推理技术准确性的方法。这些边界可用于在两个方向上估计推理和目标分布之间的KL散度。我们将推理评估技术应用于表明,扭曲SMC对于从预训练模型( harmlessness培训和自动红色代理的有用组件)中采样不良输出(有用的训练和自动红色代理的一个有用组件)和生成带有不同情感的评论以及执行填充任务非常有效。
URL
https://arxiv.org/abs/2404.17546