Abstract
Inference-time computation is a powerful paradigm to enhance the performance of large language models (LLMs), with Best-of-N sampling being a widely used technique. However, this method is computationally expensive, requiring both (1) an external reward model and (2) the generation of multiple samples. In this work, we introduce a new generative self-evaluation scheme designed to adaptively reduce the number of generated samples while maintaining or even improving performance. We use a generative reward model formulation, allowing the LLM to predict mid-generation the probability that restarting the generation will yield a better response. These predictions are obtained without an external reward model and can be used to decide whether or not to generate more samples, prune unpromising samples early on, or to pick the best sample. This capability is very inexpensive as it involves generating a single predefined token. Trained using a dataset constructed with real unfiltered LMSYS user prompts, Llama 3.1 8B's win rate against GPT-4 on AlpacaEval increases from 21% to 34% with 16 samples and math performance on GSM8K improves from 84% to 91%. By sampling only when the LLM determines that it is beneficial to do so and adaptively adjusting temperature annealing, we demonstrate that 74% of the improvement from using 16 samples can be achieved with only 1.2 samples on average. We further demonstrate that 50-75% of samples can be pruned early in generation with minimal degradation in performance. Overall, our methods enable more efficient and scalable compute utilization during inference for LLMs.
Abstract (translated)
推理时间计算是一种增强大型语言模型(LLMs)性能的强大范例,其中最佳 of-N 采样是一种广泛使用的技术。然而,这种方法在计算上非常昂贵,需要同时实现(1)一个外部奖励模型和(2)生成多个样本。在这篇工作中,我们引入了一种新的自评估方案,旨在在保持或甚至提高性能的同时动态减少生成的样本数量。我们使用了一种生成奖励模型公式,使得LLM能够预测在重新生成过程中生成更好的响应。这些预测在没有外部奖励模型的情况下获得,可以用来决定是否生成更多样本、在早期阶段剪枝有前途的样本,或者选择最好的样本。这种能力非常便宜,因为它涉及生成一个预定义的标记。通过使用由真实未过滤的 LMSYS 用户提示构建的数据集进行训练,Llama 3.1 8B 在 AlpacaEval 上的 win率从 21% 增加到 34%,在 GSM8K 上的数学表现从 84% 增加到 91%。通过仅在LLM确定这样做有益时进行采样,并动态调整温度退火,我们证明了使用16个样本的改善量可以达到平均使用1.2个样本的74%。此外,我们还证明了在生成过程中,50%至75%的样本可以在最小性能损失的情况下被剪枝。总体而言,我们的方法使LLM在推理过程中实现更高效和可扩展的计算利用率。
URL
https://arxiv.org/abs/2410.02725