Paper Reading AI Learner

Specializing Smaller Language Models towards Multi-Step Reasoning

2023-01-30 08:51:19
Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, Tushar Khot

Abstract

The surprising ability of Large Language Models (LLMs) to perform well on complex reasoning with only few-shot chain-of-thought prompts is believed to emerge only in very large-scale models (100+ billion parameters). We show that such abilities can, in fact, be distilled down from GPT-3.5 ($\ge$ 175B) to T5 variants ($\le$ 11B). We propose model specialization, to specialize the model's ability towards a target task. The hypothesis is that large models (commonly viewed as larger than 100B) have strong modeling power, but are spread on a large spectrum of tasks. Small models (commonly viewed as smaller than 10B) have limited model capacity, but if we concentrate their capacity on a specific target task, the model can achieve a decent improved performance. We use multi-step math reasoning as our testbed because it is a very typical emergent ability. We show two important aspects of model abilities: (1). there exists a very complex balance/ tradeoff between language models' multi-dimensional abilities; (2). by paying the price of decreased generic ability, we can clearly lift up the scaling curve of models smaller than 10B towards a specialized multi-step math reasoning ability. We further give comprehensive discussions about important design choices for better generalization, including the tuning data format, the start model checkpoint, and a new model selection method. We hope our practice and discoveries can serve as an important attempt towards specialized smaller models in the new research paradigm set by LLMs.

Abstract (translated)

大型语言模型(LLM)在仅使用少量思考链prompt的情况下进行复杂的推理令人惊讶地表现出色,这种能力据说仅在非常大规模的模型中(超过100亿参数)出现。我们表明,这种能力实际上可以从GPT-3.5($geq 175B)到T5变体($leq 11B)进行蒸馏。我们提议模型专业化,将模型的能力专业化到特定任务上。假设大型模型(通常被认为是大于100B的模型)具有强大的建模能力,但分布在广泛的任务中。小型模型(通常被认为是小于10B的模型)具有有限的模型能力,但如果将它们的能力集中在特定的目标任务上,模型可以实现良好的性能改进。我们使用多步数学推理作为测试平台,因为它是一种非常典型的 emergent 能力。我们展示了模型能力的两个重要方面:(1)。语言模型的多维能力之间存在一个非常复杂的平衡/ trade-off;(2)。通过支付减少通用能力的代价,我们可以清楚地将小于10B的模型的 scaling 曲线升高到专业化的多步数学推理能力。我们还就更好的泛化设计选择进行了全面的讨论,包括调整数据格式、开始模型检查点以及新的模型选择方法。我们希望我们的实践和发现能够成为专业化小型模型在LLM所设定的新研究范式中的重要尝试。

URL

https://arxiv.org/abs/2301.12726

PDF

https://arxiv.org/pdf/2301.12726.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot