Paper Reading AI Learner

LIMO: Less is More for Reasoning

2025-02-05 17:23:45
Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, Pengfei Liu

Abstract

We present a fundamental discovery that challenges our understanding of how complex reasoning emerges in large language models. While conventional wisdom suggests that sophisticated reasoning tasks demand extensive training data (>100,000 examples), we demonstrate that complex mathematical reasoning abilities can be effectively elicited with surprisingly few examples. Through comprehensive experiments, our proposed model LIMO demonstrates unprecedented performance in mathematical reasoning. With merely 817 curated training samples, LIMO achieves 57.1% accuracy on AIME and 94.8% on MATH, improving from previous SFT-based models' 6.5% and 59.2% respectively, while only using 1% of the training data required by previous approaches. LIMO demonstrates exceptional out-of-distribution generalization, achieving 40.5% absolute improvement across 10 diverse benchmarks, outperforming models trained on 100x more data, challenging the notion that SFT leads to memorization rather than generalization. Based on these results, we propose the Less-Is-More Reasoning Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning capabilities can emerge through minimal but precisely orchestrated demonstrations of cognitive processes. This hypothesis posits that the elicitation threshold for complex reasoning is determined by two key factors: (1) the completeness of the model's encoded knowledge foundation during pre-training, and (2) the effectiveness of post-training examples as "cognitive templates" that show the model how to utilize its knowledge base to solve complex reasoning tasks. To facilitate reproducibility and future research in data-efficient reasoning, we release LIMO as a comprehensive open-source suite at this https URL.

Abstract (translated)

我们提出了一项基本发现,该发现挑战了人们对大型语言模型中复杂推理如何产生的理解。虽然传统的观点认为,复杂的推理任务需要大量的训练数据(>100,000个样本),但我们证明,仅用极少数的示例就可以有效地激发复杂的数学推理能力。通过全面的实验,我们提出的模型LIMO在数学推理方面表现出前所未有的性能。使用仅仅817个精心挑选的训练样本,LIMO在AIME测试中实现了57.1%的准确率,在MATH测试中的准确率为94.8%,分别比之前基于SFT(Supervised Fine-Tuning)模型的6.5%和59.2%有显著提升,而使用的训练数据仅为先前方法所需的1%。LIMO展示了卓越的域外泛化能力,在涵盖十个不同基准的数据集上实现了绝对改进达40.5%,超越了那些使用100倍更多数据进行训练的模型,挑战了SFT导致记忆而非泛化的观点。 基于这些结果,我们提出了“少即是多推理假设”(LIMO假设):在经过预训练阶段已全面编码领域知识的基础模型中,复杂的推理能力可以通过最少但精心策划的认知过程演示来激发。这一假设认为复杂推理的激发门槛取决于两个关键因素: 1. 在预训练期间所编录的知识基础的完整性; 2. 后期训练示例作为“认知模板”的有效性,这些示例向模型展示了如何利用其知识库解决复杂的推理任务。 为了促进高效数据推理的可重复性和未来研究,我们已经将LIMO作为一个全面开放源代码套件发布在此URL:[此链接]。

URL

https://arxiv.org/abs/2502.03387

PDF

https://arxiv.org/pdf/2502.03387.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot