Paper Reading AI Learner

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

2024-07-31 17:57:25
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher R\'e, Azalia Mirhoseini

Abstract

Scaling the amount of compute used to train language models has dramatically improved their capabilities. However, when it comes to inference, we often limit the amount of compute to only one attempt per problem. Here, we explore inference compute as another axis for scaling by increasing the number of generated samples. Across multiple tasks and models, we observe that coverage - the fraction of problems solved by any attempt - scales with the number of samples over four orders of magnitude. In domains like coding and formal proofs, where all answers can be automatically verified, these increases in coverage directly translate into improved performance. When we apply repeated sampling to SWE-bench Lite, the fraction of issues solved with DeepSeek-V2-Coder-Instruct increases from 15.9% with one sample to 56% with 250 samples, outperforming the single-attempt state-of-the-art of 43% which uses more capable frontier models. Moreover, using current API pricing, amplifying the cheaper DeepSeek model with five samples is more cost-effective and solves more issues than paying a premium for one sample from GPT-4o or Claude 3.5 Sonnet. Interestingly, the relationship between coverage and the number of samples is often log-linear and can be modelled with an exponentiated power law, suggesting the existence of inference-time scaling laws. Finally, we find that identifying correct samples out of many generations remains an important direction for future research in domains without automatic verifiers. When solving math word problems from GSM8K and MATH, coverage with Llama-3 models grows to over 95% with 10,000 samples. However, common methods to pick correct solutions from a sample collection, such as majority voting or reward models, plateau beyond several hundred samples and fail to fully scale with the sample budget.

Abstract (translated)

翻译:将用于训练语言模型的计算量规模扩大,已经显著提高了其能力。然而,在推理时,我们通常将计算量限制为每个问题仅尝试一次。在这里,我们通过增加生成样本的数量,将推理计算作为另一种扩展方法进行研究。在多个任务和模型中,我们观察到,覆盖率(解决任何尝试问题所占的比例)与样本数量成四维关系。在像编码和形式化证明这样的领域中,所有答案都可以自动验证,这些增加的覆盖率直接转化为性能的提高。当我们将重复抽样应用于SWE-bench Lite时,DeepSeek-V2-Coder-Instruct用一个样本解决问题的比例从15.9%增加到250个样本时的56%,超过了使用更强大的前沿模型的单次尝试状态。此外,使用当前API定价,用五个样本放大DeepSeek模型比为GPT-4o或Claude 3.5 Sonnet单次样本支付的 premium 更经济,并且能解决更多的问题。有趣的是,覆盖率与样本数量之间的关系通常是对数线性,并且可以用对数线性函数建模,这表明存在推理时间扩展定律。最后,我们发现,在缺乏自动验证领域,从许多代中正确选择样本仍然是一个重要的研究方向。当解决GSM8K和MATH中的数学单词问题时,使用Llama-3模型的覆盖率在10,000个样本时超过95%。然而,像多数投票或奖励模型这样的常见方法在超过几百个样本时趋于平稳,并且无法完全随着样本预算扩展。

URL

https://arxiv.org/abs/2407.21787

PDF

https://arxiv.org/pdf/2407.21787.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot